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Preface 



The AMDO 2000 workshop took place at the Universitat de les Hies Balears 
(UIB) on 7-9 September 2000, sponsored by the International Association for 
Pattern Recognition Technical Committee, the European Commission by Hu- 
man Potential Program: High Level Scientific Conferences and the Mathematics 
and Computer Science Department of UIB. The subject of the workshop was on- 
going research in articulated motion on the sequence of images and sophisticated 
models for deformable objects. The goals of these areas are to understand and 
interpret object motion around complex objects that we can find in sequences 
of images in the real world. These topics (geometry and physics of deformable 
models, motion analysis, articulated models and animation, visualization of de- 
formable models, 3D recovery from motion, single or multiple human motion 
analysis and synthesis, applications of deformable models and motion analysis, 
etc.) are interesting examples of how research can be used to solve more general 
problems. Another objective of this workshop was to relate fields using com- 
puter graphics, computer animation or applications in several disciplines com- 
bining synthetic and analytical images. In this regard it is of particular interest 
to encourage links between researchers in areas of computer vision and computer 
graphics who have common problems and frequently use similar techniques. The 
workshop included four sessions of presented papers and two tutorials. Invited 
speakers treating various aspects of the topics were: Y. Aloimonos from the 
Computer Vision Laboratory, Center for Automation Research, University of 
Maryland, USA, G. Medioni from the Institute for Robotics and Intelligent Sys- 
tems, University of Southern California, USA, and R. Boulic, Adjoint scientifique 
from the Swiss Federal Institute of Technology Lausanne, Switzerland. 
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Abstract. Manipulation of deformable objects will be discussed. Ma- 
nipulation of deformable objects is defined as controlling deformation 
of objects as well as their positions and orientations. The manipulation 
is a fundamental and important task in many industrial fields. In fact, 
there exist many operations of deformable objects such as textile fab- 
rics, rubber parts, paper sheets, strings, and foods. In order to realize 
the manipulation of deformable objects by mechanical systems, an object 
model is indispensable. It is, however, difficult to build exact model of the 
deformable objects due to their strong nonlinearity such as friction, hys- 
teresis and parameter variations. Thus, such operations strongly depend 
on skilled human workers. To overcome this problem, we will propose a 
robust control strategy using a model based technique. We will build a 
coarse model of an object for the manipulation and will develop a con- 
trol method robust to the discrepancy between the object and its model. 
Experimental results will show the robustness of the proposed method. 



1 Introduction 

There exist many manipulative tasks that deal with deformable objects such 
as textile fabrics, rubber parts, paper sheets, and food products. Most these 
operations strongly depend on skilled human workers. We define manipulation 
of deformable objects as controlling of deformation of deformable objects as 
well as their positions and orientations in this paper. For example, a position- 
ing operation called linking is involved in the manufacturing of seamless knitted 
products [1]. In linking of fabrics, knitted loops at the end of a fabric must be 
matched to those of another fabric so that the two fabrics can be sewed seam- 
lessly. This operation is now done by skillful humans and automatic linking is 
required in manufacturing of knitted products. In this research, we describe the 
manipulations of deformable objects including linking by positioning of multi- 
ple points on the objects. Then, we regard the manipulations as the operations 
in which multiple points on a deformable object should be guided to the final 
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locations simultaneously as shown in Fig.l. In many cases these points cannot 
be, however, manipulated directly. Thus, the guidance of positioned points must 
be performed by controlling some points except the positioned points. This op- 
eration is referred to as indirect simultaneous positioning [2]. In this paper, we 
will focus on indirect simultaneous positioning as an example of manipulation 
of deformable objects. 

Some researches on manipulations of deformable objects have been con- 
ducted. For automated manufacturing of textile fabrics, many researches have 
been done [3] . Ono et al. [4] have derived a strategy for unfolding a fabric piece 
based on cooperative sensing of touch and vision. In these researches, since their 
approaches are for a specific task, thus it is difficult to apply the results to other 
different tasks with a systematic manner. In addition, some researches have tried 
to deal with on more general deformable object with systematic manners as fol- 
lows. Hirai et al. [5] have proposed a method for modeling linear objects based 
on their potential energy and analyzed their static deformation. Wakamatsu et 
al. [6] have analyzed grasping of deformable objects and introduced bounded 
force closure. Their approach is static, control of manipulative operations is out 
of consideration. Howard et al. [7] have proposed a method to model elastic 
objects by the connections of springs and dampers. A method to estimate the 
coefficients of the springs and dampers has been developed by recursive learning 
method for grasping. This study has focused on model building. Thus, con- 
trol problems for manipulative operations have not been investigated. Sun et 
al. [8] have studied on the positioning operation of deformable objects using 
two manipulators. They have focused on the control of the object position while 
deformation control is not discussed. 



Robotic fingers 




Fig. 1. Indirect positioning of deformable object 



In order to realize indirect simultaneous positioning, object model is indis- 
pensable. However, it is difficult to build exact model of the deformable objects 
in general due to nonlinear elasticity, friction, hysteresis, parameter variations, 
and other uncertainties. These are main difficulties in manipulating deformable 
objects. To solve this dilemma, we propose to utilize a coarse object model to 
derive task strategies with a vision sensor. In our approach, we first build a 
coarse model of a manipulated object. Then, the task is analyzed based on the 
proposed coarse model. One of the advantages using the coarse object model is 
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that we can analyze the task and may realize its essence relatively easily. Based 
on the results of the analysis, we can derive a control law robust to discrepancy 
between the object and its coarse model. 

In this article, we will firstly propose a coarse model of deformable objects. 
Next, indirect positioning will be analyzed based on the coarse model. As the 
result, we will derive conditions to examine whether the given positioning is 
feasible or not. Then, we will propose a control method robust to model er- 
rors based on the coarse model. Experimental results show the validity of the 
proposed method and the effects of the parameter errors on the convergence. 

2 Formulation of Indirect Positioning 

2.1 Modeling of Extensible Deformable Objects 

First of all, a model of deformable objects is proposed. On modelling of de- 
formable objects, many researches have been conducted. For example in the 
area of computer graphics, cloth deformation is animated by Terzopoulos [9] 
or Louchet, Provot and Crochemore [10] and other many researchers. Our re- 
search goal is to realize robust manipulation of deformable objects. Therefore, 
we employ more simple deformation model. We model the object by connection 
of simple springs similar with Naster and Ayache [11]. For simplicity, we deal 
with two dimensional deformable objects such as textile fabrics. We discretize 
the object by mesh points. Each mesh point is connected by vertical, horizontal, 
and diagonal springs as shown in Fig. 2. In the model, we assume that the object 
deforms in a two-dimensional plane. In order to formulate the manipulation of 
deformable objects, object model must have the ability to describe translation, 
orientation, and deformation of the object simultaneously. Thus, position vec- 
tor of the mesh points is utilized. Position vector of the (i,j)-th mesh point is 
defined as {i = 0, • • • ,M;j = 0, • • ■ ,N). Coefficients kx, ky, kg 

are spring constants of horizontal, vertical, and diagonal springs. Assume that 
no moment exert on each mesh point. Then, the resultant force exerted on mesh 
point Pi j can be described as eq.(l). 



U denotes whole potential energy of the object. Then, function U can be cal- 
culated by sum of all energies of springs [2]. Here, we assume that the shape 
of the object is dominated by eq.(l). Then, we can calculate the deformation 
of the object by solving eq.(l) under given constraints. Note that the following 
discussions are valid even if the object has an arbitrary three-dimensional shape 
by modeling the object similarly. Details have been reported in [2]. 




( 1 ) 
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(M, 0) 




(M,N) 



( 0 , 0 ) 



(0,N) 



F 



Fig. 2. Spring model of deformable object 



2.2 Problem Description 

Here, we classify mesh points into the following three categories(see Fig. 4) 
in order to formulate indirect simultaneous positioning. 

manipulated points: are defined as the points that can be manipulated directly 
by robotic fingers. {A) 

positioned points: are defined as the points that should be positioned indi- 
rectly by controlling manipulated points appropriately. (Q) 
non-target points: are defined as the all points except the above two points, 
(others in Fig. 4) 



Let the number of manipulated points and of positioned points be m and p, 
respectively. The number of non-target points is n = (M -|- 1) x {N -|- 1) — 
m — p. Then, is defined as a vector that consists of coordinate values of 
the manipulated points. Vectors Vp and r„ are also defined for positioned and 
non-target points in the similar way. Eq.(l) can be rewritten as eqs.(2),(3) using 
and 




Fig. 3. Classification of mesh point 
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dU{rm,rn,rp) 



dr 



m 



■ du{rm,rn,rp) ' 

dr 

du(rm, VnXp) 
dr„ 



- A = 0, 


(2) 


= 0 


(3) 



where a vector A denotes a set of forces exerted on the object at the manipulated 
points rm by robotic fingers. 

Note that the external forces A can appear only in eq.(2), not in eq.(3). 
This implies that no external forces are exerted on positioned points and non- 
target points. These equations represent characteristics of indirect simultaneous 
positioning of deformable objects. 

Let us consider the following task: 



[Task] Assume that the configuration of robotic fingers and the positioned 
points on an object are given in advance. Then, the positioned points rp are 
guided to their desired location by controlling manipulated points appro- 
priately. 

In order to realize the given task, an object model is indispensable since 
we have to predict directions of displacements of positioned points during the 
positioning. Then, the proposed model is useful for this purpose. However, in 
general, the model errors cannot be ignored in modeling deformable objects 
due to many uncertainties. Thus, a model inversion approach is not effective. 
Therefore, it is important to develop a control method that is robust to the 
error between the object and its model. 

3 Analysis of Indirect Positioning 

3.1 Infinitesimal Relation among Positioned Points and 
Manipulated Points 

In this section, we analyze indirect simultaneous positioning based on the pro- 
posed coarse model. Let us derive infinitesimal relation among positioned points 
and manipulated points. Now, consider a neighborhood around an equilibrium 
point ro = rjg, rj'^'g]'^. We can obtain the following equation by linear lizing 
eq.(3) around the equilibrium point. 

ASrm + BSvn -\- CSvp = 0 (4) 

where 






d^U 

drm dTp 

d^U 

. drm drn 



rO 



^ ^(2ji+2n)x2m 
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G ^(2p+2n)x2n 
rO 

^ j^(2p+2n)x2p 
rO 

Vector Srm is defined as an infinitesimal deviation of the manipulated points 
from their equilibrium points. Vectors Svn and 6rp are defined in the similar 
way. By transforming eq.(4), eq.(5) is obtained. 

= -CSvp (5) 

where F = [A B], 

3.2 Feasibility of Indirect Positioning 

We can obtain the following theorems. The proof of these theorems have been 
reported in [12]. 

Theorem 1. There exist infinitesimal displacements of manipulated 
points SVra 

corresponding to arbitrary infinitesimal displacements 5rp, if and only 

if, 

rank[A B] = 2p + 2n is satisfied. 

In addition, Theorem 1 needs the following result. 

Result 1 The number of the manipulated points must be greater than or equal to 
that of the positioned points in order to realize any arbitrary displacement Svp , 
that is, m> p. 

In the case that the number of the manipulated points is equal to that of 
the positioned points, that is, m = p is satisfied. Theorem 1 can be rewritten as 
follows: 

Theorem 2. In the case ofm = p, there exist displacements of the manipulated 
points 5rm corresponding to any displacements of positioned points Srp and these 
are determined uniquely, if and only if, det[A B] ^ 0. 



Srn 



d^u 



dr.a 


drp 


d- 


'■u 


drn 


d'f'n 


9' 


'■u 


9rp 


drp 


9^ 


■u 


drp 


d'f'n 
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4 Iterative Control Law of Indirect Simnltaneous 
Positioning 

In this section, we propose a novel control method to achieve an indirect simul- 
taneous positioning. An iterative control law is derived based on a linearlized 
model of the object eq.(5). 

In the control of indirect simultaneous positioning of deformable objects, 
a vision system is utilized to measure current positions of positioned points. 
Positions of the manipulated points can be computed from the locations of the 
robotic fingers because the fingers pinch an object firmly. On the other hand, it 
is difficult to measure positions of non-target points due to their number. 

Now, let us derive an iterative control law for indirect simultaneous position- 
ing based on linearlized equations (5). Fig. 4 shows a flow of our proposed control 
method. Assume that the number of positioned points is equal to the number of 
manipulated points, that is, m = p. In this case, A is a square matrix. This arti- 
cle deals with only the case that the matrix F is non-singular during operations, 
for simplicity. Then, eq.(5) can be rewritten as the following two equations: 

5rm = -SuF~^CSrp ( 6 ) 

= -SLF-^C6rp (7) 

where Su = [Im Omxn] Sl = [0„xm In]- Let r^, and be positions of 
manipulated points, those of non-target points, and those of positioned points 
at A:-th iteration, respectively. In eq.(6), replacing deviation Sr^n with difference 
and deviation Svp with error r^, we obtain the following equation: 

=rt- dSuF^^C,{r^p - r^) (8) 

where Fk and Ck are functions of r^, r^, and r^. Superscript and subscript k on 
variables denote their values at /c-th iteration. A scalar d denotes a scaling factor. 
The right hand side of this equation can be evaluated at the /c-th iteration. Thus, 
desired locations of manipulated points at the k-th iteration can be updated into 
those at the {k+ l)-th iteration by this equation. Note that matrix depends 
not only and Vp but also r„. Thus, non-target points is estimated by 
eq.(9). 



- dSLF^_\Ck-i{r^p-^ - (9) 

As a result, the proposed iterative control method is summarized as follows: 
First, a vision system senses current positions of positioned points. Second, lo- 
cations of manipulated points and those of non-target points are updated using 
eq.(8) and (9), respectively. Then, robot fingers are controlled with respect to 
task oriented coordinates using as their desired positions in {k + l)-th 

iteration with an appropriate controller. For example, we can utilize linear PID 
feedback. After robot fingers converged to positions of positioned points 

are measured again by the image sensor. Then, the same procedure is 
iterated. 
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k = 


jH-1 


Control position of operation points so that 






their locations coincide with 



I 




Measure positions of positioned points 



Start 



Fig. 4. Flow of proposed control method 



We can show that the positioned points rp can be converged to the desired 
ones by control law (8) even if the model includes some errors. The details have 
been reported in [12]. 

5 Experiments 

5.1 Textile Fabric 

In this section, we will show experimental results in order to illustrate the validity 
of the proposed control method and to investigate the effect of model errors on 
the convergence quantitatively. Fig. 5 illustrates the experimental setup. Three 
2DOF robots with stepping motors are utilized as robotic fingers. A CCD camera 
is utilized as a vision sensor. A deformable object is laid on a table. In the exper- 
iments, knitted fabrics of the acrylic 85[%] and wool 15[%] (100[mmj x 100[mmj) 
are utilized. The fabric is descritized into 4x4 meshes. Both of the numbers of 
the manipulated and positioned points are three. Their initial locations on the 
object are shown in Fig. 6. Markers are put on the positioned points of the fab- 
ric. Their positions are measured by the CCD camera. The configurations of the 
manipulated and positioned points are as follows: 

rm = [ xo , 3 , yo ,3 , a:qo, 2 / 1 . 0 , 2^3.2, 2/3.2]^, 

= [a:i.i, 2/1,1 , a;i.2,2/i.2, 2^2,2, 2/2,2]^- 
The desired positioned points used in the experiments are 

= [30,40, 65,50, 53.6,90]^. 

We have identified spring constants {kx,ky,kg) = (4.17,13.2,3.32) [gf/mm] 
coarsely for the control method, through tensile tests. Note that the ratio of the 
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Fig. 5. Experimental setup 



AManipulated points 9 Positioned points 




Fig. 6. Configuration of points in experiments 



spring constants is important in our control method. Then, we define a = kx/kg 
and (3 = ky/kg. From coarsely identified spring constants, a = 1.256 and (3 = 
3.976 are obtained. In experiments, various values of (3 including errors were 
utilized in the control method in order to investigate the effects of model errors. 
Moreover, values 0.1 and 0.5 of scaling factor d are used in eq.(8) to show the 
effects of the scaling factors. Fig.7-(a) and (b) illustrate the experimental results 
of d = 0.1 and 0.5, respectively. 

In these figures, we can find that the positioned points can converge to the 
desired positions if the coefficients {a, (3) is near their identified values, while they 
are gradually oscillatory or diverge if spring constant is far from the identified 
values. Fig.7-(a) shows that the positioned points converge to the desired ones 
despite of 100 times or 0.01 times of deviations of parameters a and j3. Fig.7-(b) 
shows that the positioned points diverge for 10 times and 0.1 times deviations 
of the parameters. On the other hand, the speed of convergence is higher with 
d = 0.5. As an example. Fig. 8 shows behaviors of manipulated and positioned 
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Fig. 7. Experimental results 



points with d = 0.5, {a, (3) = (1.256,3.976). The accuracy of convergence to 
the desired ones can be reached to a resolution level of the visual sensor (about 
1 [mm]). 




— Manipulated point 
-O - Positioned point 
X Desired point 



Fig. 8. Behavior of positioned and manipulated points in experiments 



According to the experimental results, we can conclude that very coarsely 
estimated parameters can be utilized in the proposed control method. Scaling 
factor d should be chosen carefully. 

5.2 Sponge Block 

In this section, we apply the proposed control method to compressing defor- 
mation of sponge blocks. In these operations, we have to consider the following 
items: 

1. In section 5.1, the robots pinch the object firmly. In the operation of sponge 
blocks, we cannot pinch the objects. Thus, we have to realize manipula- 
tions with grasping the object stably. In this paper, we do not deal with 
this matter, and apply the same control law. We will investigate the effect 
experimentally with various desired locations. 



Robust Manipulation of Deformable Objects Using Model Based Technique 



11 



2. The positions of manipulated points on the objects may change by slipping. 
With the proposed control method, we do not have to consider the effect in 
detail since our proposed method is also robust to error of the locations of 
manipulated points on the object. Thus, we can employ the same method. 

In this experiments, sponge blocks (90 [mm] x 90 [mm] x 30 [mm]) are utilized 
as shown in Fig. 9. Suppose that we consider two-dimensional deformation in 
a plane, then we ignore the deformation along thickness direction. Locations 
of positioned points and those of manipulated points are illustrated in Fig. 10. 
Coordinates of positioned points and those of manipulated points are given as 
follows: 



rp = [30,30, 60,30, 60,60]^, 



[60,0, 0,30, 90,60]'^ 



( 10 ) 




Fig. 9. Manipulation of sponge block 



As shown in Fig. 11, experiments performed for 6 patterns of desired positions. 
Desired location of each pattern is given as follows: 



Pattern 1: 
Pattern 2: 
Pattern 3: Vp 
Pattern 4: 
Pattern 5: 
Pattern 6: Tp 



[35, 30, 65, 30, 65, 60]^ (translation) 

[35,30, 62,30, 62,60]^ (compression with translation) 

[32,30, 65,30, 65,60]^ (stretching with translation) 

[27.6,32.8, 57.2,27.6, 62.4,57.2]^ (rotation) 

[32,32, 58,32, 58,58]^ (compression) 

[32.6,37.8, 62.2,32.6, 67.4, 62.2]^ (orientation and translation) 
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A Manipulated points # Positioned points 




Fig. 10. Location of positioned and manipulated points 
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Fig. 11. Motion pattern of positioned points 
Table 1. Experimental results for sponge block 
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Experimental results of sponge manipulations are shown in Table 1. In the 
first row of each pattern, Q denotes convergence with e = l[mm], A denotes 
convergence with e = 2, and x for no convergence. Second and third rows denote 
the numbers of iterations that are needed for convergence within e = 1 and 2, 
respectively. Fig. 12 (a) and (b) illustrate the experimental results of pattern 2 
and 4, respectively. 



Robust Manipulation of Deformable Objects Using Model Based Technique 
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Fig. 12. Motion of positioned and manipulated points 



According to the results, we can find that the error norm is converged within 
e = 1 in pattern 1, 4, and 6, manipulations without deformation. In pattern 
2, 5 with deformation, the errors are converged in 2. One of the reasons is on 
accuracy of the vision sensor. In the experiments, the accuracy is 0.72 [mm]. In 
addition, inappropriateness of the locations of manipulated points on the objects 
is also one of the reasons. Pattern 3 denotes the stretching operations. Thus, the 
manipulation is failed since the robots do not pinch the object. 

6 Conclusions 

In this paper, indirect simultaneous positioning of deformable objects were dis- 
cussed as an example of manipulation of deformable objects. First, we have 
proposed a coarse model of deformable objects for their positioning operations. 
Second, indirect simultaneous positioning of deformable objects have been for- 
mulated. Based on the formulations, we analyzed the indirect positioning. As 
the result, we derived the conditions that the given positioning can be achieved. 
Then, we have proposed a novel iterative control method to realize indirect si- 
multaneous positioning based on the coarse object model. The validity of the 
method has been shown through experimental results using the textile fabrics, 
and effects of the model errors on the convergence were investigated. Then, we 
conclude that very coarse identifications can be utilized for the proposed method. 
In addition, we apply the proposed control method to compressing operations 
by grasping. In compressing operations, the error can be converged in a desired 
region. 

In this article, initial locations of operation points on a deformable object 
were given before executing the manipulation. However, the task may be failed 
in or excessive forces may be exerted on the object in the case that the configu- 
ration of the positioned points is not appropriate. Then, task planning including 
configuration of operation points is important. Therefore, we need a method to 
plan configurations of robot fingers on a deformable object [13]. In compressing 
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operation, it is impossible to prove that stable grasp is maintaining during ma- 
nipulations in all cases since we do not take states of grasping into consideration. 
Therefore, we have to add information of grasping states to control law. Use of 
force sensor for this purpose is one of the important future works. 
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Abstract. We design a generic contrast and affine invariant planar 
shape recognition algorithm. By generic, we mean an algorithm which 
delivers a list of all shapes two digital images have in common, up to any 
affine transform or contrast change. Affine invariance must not be con- 
sidered as a weak requirement in shape recognition since it is well known 
that if 3D objects with smooth boundaries are being observed, then the 
deformation of the objects’ images, far enough from the occlusion bound- 
aries, is locally an affine transform. We define as ’’shape elements” all 
pieces of level lines of the image. We then discuss an efficient local en- 
coding of the shape elements. We finally show experiments. Applications 
aimed at include image registration, image indexing, optical flow. 



1 Introduction 

The shape recognition problem has many different and very specific meanings, 
because each author somehow builds his own definition. In many cases, shape 
recognition is built for a specific application, like e.g. automatic character recog- 
nition, or fingerprint recognition. Another approach, related to psychophysics, 
tries to define what we really mean when stating that two shapes look alike. Re- 
cently, various strategies to rigorously define distances between shapes have been 
proposed [27]. This distance method allows large nonparametric deformations. 
In this communication, we shall restrict ourselves to the case where perturba- 
tions boil down to contrast changes, planar affine transforms and occlusions. 
This restrictive framework is just sufficient to recognize an image which has un- 
dergone a xerocopy or a photograph (if it is a painting) and is thereafter subject 
to contrast changes and an arbitrary framing (occlusion on the boundary). The 
affine invariant framework is a well acknowledged topic [3,5,14,15]. 

The restrictions we are taking are not arbitrary, but result from a hopefully 
rigorous invariance analysis. We first argue that the local contrast invariant 
information of an image is completely contained in its level lines ([6,7]), which 
turn out to be Jordan curves. In order to overcome the occlusion phenomena, 
we wish to have an encoding as local as possible. The locality is obtained by 
segmenting each level line into its smallest meaningful parts which must finally 



H. H. Nagel and F. J. Perales (Eds.): AMDO 2000, LNCS 1899, pp. 15—26, 2000. 
© Springer- Verlag Berlin Heidelberg 2000 



16 



J. L. Lisani et al. 



be described by small codes. The curve segmentation-encoding process must 
therefore be itself invariant. 

Moreover, the description of the curves must involve some smoothing since 
level lines are influenced by the quantization process. Thus, smoothing must 
be performed in order to get rid of this influence. Another reason to smooth 
shapes, is given by the ’’scale space ideology” [26]. Indeed, many of the fine scale 
oscillations of the shapes may be parts of the shape; the analysis of the shape 
would be lost in those details. 

Following [1], the only contrast invariant, local, smoothing and affine invari- 
ant scale space leads to a single PDF, 



where Du is the gradient of the image, curv(rt) the curvature of the level line 
and t the scale parameter. This equation is equivalent to the ” affine curve short- 
ening” ([24]) 



where x denotes a point of a level line, Curv(x) its curvature and n the signed 
normal to the curve, always pointing towards the concavity. 

This equation is the only possible smoothing under the invariance require- 
ments mentionned above. This gives a helpless bottleneck to the local shape 
recognition problem, since it is easily checked ([!]) that no further invariance 
requirement is possible. Despite some interesting attempts [12], there is no way 
to define a projective invariant local smoothing. The use of curvature-based 
smoothing for shape analysis is not new[2,16,ll]. 

The contrast invariance requirement leads us to describe the shapes in terms 
of mathematical morphology [25]. In [8], connected components of level sets are 
proven to be invariant under contrast changes and [7] proposed to take as basic 
elements of an image the boundaries of the level sets (the so called level lines) , 
a complete representation of the image which they call topographic map. A 
fast algorithm for the decomposition of an image into connected components 
of level lines is described in [22] and its application to a semi-local scale-space 
representation in [23] . Each one of these connected components is a closed Jordan 
curve and in many cases, we shall identify the term ’’shape” with these Jordan 
curves. 

In Section 2, a fast algorithm to perform equation (2) is derived by going back 
to the mathematical morphology formalism ([25,18]) and defining first an affine 
distance and then affine erosions and dilations. This leads us to an axiomatic 
justification for a fast algorithm introduced by Moisan ([19,20]). This presenta- 
tion follows the general line of a book in preparation [13]. Section 3 is devoted to 
a discussion of the basic requirements of computer based algorithms for shape 
recognition. In Section 4, we explain how to segment the smoothed curves into 
affine invariant parts and how these pieces of level lines can be encoded in an 




( 1 ) 



dx 



|Curv(x)| 3n, 



( 2 ) 
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efficient way for matching. Section 5 gives a first account of what can be done 
with the generic algorithm. 



2 AfRne Invariant Mathematical Morphology and PDE’s 

We first define an ’’affine invariant distance” which will be a substitute to the 
classical euclidean one. We consider shapes X, subsets of . Let x G IR? and 
A an arbitrary straight line passing by x. We consider all connected components 
of \{X \J A), li X ^ exactly two of them contain x in their boundary. 
We denote them by CXi{x,A,X), CX 2 (x,A,X) and call them the ’’chord-arc 
sets” defined by x, A and X, and we order them so that area(CAi(x, Zi, X)) < 
area(CA 2 (a;,Z\,X)). 

Definition 1. Let X be a ’’shape” and x G M^,x ^ X. We call affine distance 
of X to X the (maybe infinite) number 6{x,X) = inf/i area(CAi(a;, zi, 

(5(a;,X) = 0 if X € X. 

Definition 2. For X CZ CQilJj aJJljTiG CL diilcxt/G oj" iflG SGi D — 

{x^5{x^X) < Wg call affine a-croded of X the set EaX = {xffi{x^X"^) > 

ai/2} = {baX-y. 

Proposition 1. Ea and Da are special affine invariant (ie they commute with 
area preserving affine maps) and monotone operators. 

Proof. It is easily seen that if X C X, then for every x, 6{x, X) > 6{x, Y). From 
this, we deduce that X C X ^ baX C baY. The monotonicity of Ea follows 
by the duality relation EaX = {baXy^’. The special affine invariance of ba and 
Ea follows from the fact that if detA = 1, then area(X) = area(AX). 

Remark : One can show that Ea and ba are affine invariant in the sense 
of Definition 14.19, in [13] that is, for every linear map A with detA > 0, 
zl-E^(detA)i/2a = EaA. 

We shall now use the following theorem in order to give a standard form to 
Ea and Da : 

Theorem 1. (TVlatheron Theorem^. Let T be a translation invariant mono- 
tone operator acting on a set of subsets of ID . Then there exists a family of sets 
13 C V{ID) which can be defined as 13 = {X, 0 G T(X)}, such that 



nx)= u n X -y = {x,3B & B,x + B C X} (3) 

BGiB 1/GB 



Conversely, 3 defines a monotone, translation invariant operator on V{ID) 
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Definition 3. We say that B is an affine structuring element ifO is in interior 
of B, and if there is some 6 > 1 such that for every line A passing by 0, both 
connected components of B \ A containing 0 in their boundary have an area 
larger or equal to b. We denote the set of affine structuring elements by IB^s. 

Proposition 2. For every set X, 

U n X-y = {x,3BelB,e,x + a^/^BcX} 

BeiBatf y^a^/^B 



Proof. We simply apply Matheron theorem. The set of structuring elements 
associated with Ea 'is IB = {X, EaX 9 0}. Now, 

EaX 9 0 (5(0, X^) > a^/2 ^ inf^area(CAi(0, Z\, X))^/^ > 

This means that for every Z\, both connected components of X \ Z\ containing 0 
have area larger than some number b > a. Thus, X belongs to af^^lBsS by 
definition of Sdas- 

By Proposition 2, x belongs to EaX if and only if for every straight line A, 
chord-arc sets containing x have an area strictly larger than a. Conversely we 
can state : 

Corollary 1. EaX is obtained from X by removing, for every straight line A, 
all chord-arc sets contained in X which have an area smaller or equal than a. 



2.1 Application to Curve Affine Erosion/Dilation Schemes 

Let Co be a Jordan curve, boundary of a simply connected set X. Iterating affine 
erosions and dilations on X gives a numerical scheme that computes the affine 
shortening ct of cq at a given scale T. In general, the affine erosion of X is not 
simple to compute, because it can be strongly non local. However, if X is convex, 
then it has been shown in [20] that it can be exactly computed in linear time. In 
practice, c will be a polygon and the exact affine erosion of X -whose boundary 
is made of straight segments and pieces of hyperbolae- is not really needed; 
numerically, a good approximation of it by a new polygon is enough. Now the 
point is that we can approximate the combination of an affine erosion plus an 
affine dilation of X by computing the affine erosion of each convex component 
of c, provided that the erosion/dilation area is small enough. The algorithm 
consists in the iteration of a four-steps process: 

1. Break the curve into convex components. 

2. Sample each component. 

3. Apply discrete affine erosion to each component. 

4. Concatenate the pieces of curves obtained at step 3. 

• Discrete affine erosion. This is the main step of the algorithm: compute 
quickly an approximation of the affine erosion of scale a of the whole curve. The 



Shape Recognition Algorithm 19 



pj pj pj pj 
-^0 > -^0 ^ i+l 



j of each convex component 
/2. Then, the effective 



first step consists in the calculus of the “area” A 
Ci = given by Aj = 

area used to compute the affine erosion is tTg = max {ct/8, min^ Aj} . We restrain 
the erosion area to ae because the simplified algorithm for affine erosion may 
give a bad estimate of the continuous affine erosion+dilation when the area 
of one component is less than the erosion parameter. The term <t/8 is rather 
arbitrary and guarantees an upper bound to the number of iterations required 
to achieve the final scale. The discrete erosion of each component is defined as 
the succession of each middle point of each segment [AB] such that 



1. A and B lie on the polygonal curve 

2. A or B is a vertex of the polygonal curve 

3. the area enclosed by [AB] and the polygonal curve is equal to (Je 



• Iteration of the process. To iterate the process, we use the fact that if E^r 
denotes the affine erosion plus dilation operator of area cr, and h = (hi) is a 
subdivision of the interval [0,H] with H = T/lo and w = ^ then 







Ct 



as \h\ = maxi /li+i — hi —f 0, where ct is the affine shortening of cq described 
above by (2). 

The algorithm has linear complexity in time and memory, and its stability 
is ensured by the fact that each new curve is obtained as the set of the middle 
points of some chords of the initial curve, defined themselves by an integration 
process (an area computation). Hence, no derivation or curvature computation 
appears in the algorithm. 



3 Requirements for the Local Encoding of Curves 

Humans have little problems in recognizing familiar objects in a 2D image even 
when lighting conditions vary or objects have been partially occluded, indepen- 
dently of perspective effects. 

Therefore, any reliable computer based vision algorithm for shape recogni- 
tion, should satisfy these three previous requirements, namely : 

1. Invariance under changes in lighting conditions. Specifically, we should ask 
our algorithms to be invariant under contrast changes. 

2. Invariance under perspective transformations. This is in general a too strong 
requirement. Indeed, as shown in [I], a previous local smoothing of an image, 
made necessary in order to reduce quantization effects and image complex- 
ity, can commute only with planar affine transforms. Thus, we shall only ask 
in practice this weaker invariance. Notice, however, that if 3D objects with 
smooth boundaries are being observed, then the deformation of the objects’ 
images, far enough from the occlusion boundaries, is locally an affine trans- 
form. Indeed, as remarked in [9] every map is locally an affine transform. 
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A weaker requirement, which we will also investigate, is similarity invariance 
(invariance under translations, rotations and uniform scaling). 

3. Recognition of partially occluded objects. Therefore, shapes must be de- 
scribed as locally as possible in order to prevent side effects of the loss of 
some part of a shape in the description of its other parts. 

Three more requirements may be introduced in order to improve the relia- 
bility and feasibility of the algorithms : 

4 A scale-space representation [26] : such a representation, provided it is invari- 
ant enough, provides us with more and more sketchy versions of the shapes to 
be recognized. This is required, as mentionned in the introduction, in order 
to reduce the complexity of the shape descriptors and to eliminate various 
noises in the fine levels shape description, particularly the quantization ef- 
fects. Therefore, recognition algorithms should be integrated in a scale-space 
framework, which, as we have seen, can be, at most, affine invariant. 

5 Uniqueness [16], different shapes must have different descriptors. This re- 
quirement is never completely achieved, and must be replaced by a some- 
what vague requirement, according to which shapes which do not look alike 
should not be matched. Here, visual inspection remains a valid criterion to 
decide of the validity of an algorithm. We shall therefore try to show many 
examples. 

6 Incremental description : The fact of adding new features to the description 
of the shape must not disturb the previous description, but simply add a 
new information to it. 

7 Completeness of the image description : We want ALL shapes present in the 
image to undergo the shape comparison algorithm. By ’’all”, we mean that 
the list of the shapes of the image must enable us to fully reconstruct the 
image. If (e.g.) we match Image 1 with itself, the list of matching shapes will 
be of course equal to the list of all shapes present in Image I. Now, this list 
being complete, we will be able to reconstruct fully the image 1 from this 
list. In the general case, we want to be able to show an image containing 
all shapes of Image I which match Image 2 and only them. This image will 
therefore be the ’’intersection of Images I and 2” (see [4] for the definition 
of image intersection algorithms in Mathematical Morphology.) 

The first requirement leads us to describe the shapes in terms of mathematical 
morphology, as explained in section 1. The second requirement (affine invariance) 
constrains us in two ways : clearly, descriptors of the shape must be (at least) 
affine invariant and we have mentionned the impossibility of requiring more 
invariance in the context of local smoothing. In the next section, we shall discuss 
which kind of features we can use. The second constraint is less obvious and 
related to the fourh requirement : the scale-space representation of the shape 
shall also be affine invariant. Typical scale-space representations, based on a 
gaussian-type filtering, do not satisfy this requirement. An affine invariant scale 
space based on partial differential equations (PDE) have been proposed in [1] and 
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is equivalent to the ’’affine curve shortening” proposed simultaneously by [24]. 
Recently, a fast scheme for the computation of the affine shortening, applied 
directely to the level lines of an image, has been proposed in [19] and [20] (see 
Section 2 for a detailed description of the algorithm) . 

The third and fifth requirements are somehow contradictory. We must be 
able to find a description of the shapes local enough to allow local matching but 
global enough to prevent that several parts of different shapes have the same 
description. This trade-off can only be dealt with in an experimental context 
first, and then with statistical arguments which we shall not develop here. 

The sixth requirement is very important in terms of robustness and relia- 
bility of the shape descriptions. It means that any new feature introduced in 
the description of a shape will either be redundant or improve the description, 
but never disturb the previous information that we had. In fact, the seven given 
design criteria define a generic, algorithm to shape recognition. 

Following these seven design criteria, we have developped two algorithms for 
shape recognition, one of them more restrictive, only dealing with the problem of 
recognizing shapes under similarity transform, and a more general one, providing 
registration under affine transforms. 



4 Algorithms for the Description of the Shapes in an 
Image 

4.1 Similarity Invariant Description of Curves 

In the search for an invariant description of a curve, the starting point for the 
sampling must be invariant, and so must be the sampling mesh. Typically, in- 
flexion points have been chosen because they are affine invariant. Now, since the 
curve is almost straight at inflexion points, their position is not robust, but the 
direction of the tangent to the curve passing through them is. Another affine 
invariant robust semilocal descriptor is given by the lines which are bitangent to 
the curve (see Fig. 1). 




Fig. 1. Inflexion points (marked with small triangles) and bitangents of a closed 
curve. The area defined by each bitangent and the original curve is marked (Al) 



Our reference system is formed by such a line, and the next and previous 
tangents to the curve which are orthogonal to it (see Fig. 2). The intersections 
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Fig. 2. Left: Local reference system for similarity invariant normalization: refer- 
ence direction (RD), normal directions (Nl, N2) and reference points (Rl, R2). 
The portion of the curve normalized with this reference system starts at P 1 and 
ends at P2, passing through the inflexion point. Right: Similarity invariant nor- 
malization. The y-ordinate of the marked points is used to encode the piece of 
curve 



of each one of these lines with the reference line provide two reliable points 
independent of the discretization of the curve. The portion of the curve to be 
normalized is limited by these points. Normalization consists in a similarity 
transform that maps the reference line to the s-axis and that sets the distance 
between the two reference points to 1. We discretize each one of the normalized 
portions of the curve with a fixed number n of points, and we store, for each 
discretized point, its y coordinate (see Fig. 2). This set of n values is used to 
compare portions of curves. 

4.2 AfRne Invariant Description of Curves 

If we look at Fig. 1, we can observe that the portion of the curve between the 
points defining the bitangent, together with the bitangent itself, define an area 
{Al), from which further invariant features can be computed. In particular, we 
can compute the barycenter of this area, an affine invariant reference point. 
We compute then the line B1 parallel to the bitangent and passing through 
the barycenter. B1 divides the initial area into two parts and we compute the 
barycenter of the part which does not contain the bitangent (see Fig. 3). This 
second barycenter is a second reference point. Finally a point in line B1 such that 
the area of the triangle formed by this point and the two preceding barycenters 
is a fixed fraction of the initial area Al is a third reference point (see Fig. 3). 
We therefore obtain three nonaligned points, that is an affine reference system. 
This strategy is related to [10]. The discretization points are taken at uniform 
intervals of length on the normalized curve. The total length of the normalized 
curve is also used in the code. This set of 2n-|-l values is what we use to compare 
portions of curves. 

5 Experimental Results 

Figure 4 displays a picture of a man and the same picture after the man has 
moved and their level lines after smoothing with the iterative scheme described 
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Fig. 3. Left: Local reference system for affine invariant normalization: reference 
points (Rl, R2, R3). The portion of the curve to be encoded has endpoints 
PI and P2. Right: Affine invariant normalization. The length of the normalized 
piece of curve together with the x and y coordinates of the marked points are 
used to locally encode the curve 



in section 2. Clearly some level lines have suffered a significant deformation and 
some others are affected by the disocclusion of the left arm of the man. Even 
if some parts of the level line remain unchanged, registration methods based on 
global matching would fail in detecting those lines. In Figure 5, we show the 
result of the matching of several pieces of a level line in the first image with 
other pieces of level lines in the second image. 

Figure 6 displays two frames of the ’walking man’ sequence. Observe that 
occlusion and deformation are both present in the images (occlusion of the arm 
and deformation in the position of the legs). Figure 7 shows the result of the 
matching of a level line describing the man in the first image (in this case it is 
easy to find an unique level line, since there is a big contrast between the man and 
the background), and the level lines in the second image. Observe that, despite 
the occlusion and the deformation, the registration method (affine invariant) is 
able to retrieve the correct level line in the second image. 
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Abstract: Seeing the talker's lips in addition to audition can improve 
speech understanding which is rather based on lips shape temporal 
evolution than on absolute mouth shape. In this article, we propose an 
adaptation of Active Shape Model (ASM) to the extraction of lips shape 
over an image sequence. The algorithm does not require any make-up 
or markers and works under natural lighting conditions. 

After the definition of a training base, initial mouth model is iteratively 
deformed under constraints according to spatiotemporal energies 
depending either on luminance or hue. A robust prior detection of four 
points of the model is proposed in order to automatically and accurately 
initialize the egde detection. 

The success of our approach is tested on many image sequences of 
multi-speakers with multi-speaking. 

Keywords: Active Shape Model, automatic lips edge detection, AGP, 
natural lighting conditions. 



1 Introduction 

Our research is a part of an advanced audio-visual communication tool which 
integrates both audio and visual features. It aims at improving speech understanding 
in the presence of acoustic noise by extracting lips shape. It is well known that seeing 
the talker's lips in addition to audition can improve speech intelligibility (cf labial 
reading for deafs). 

Extracting lips boundaries is a difficult task because lips shape is highly variable. 
Variability comes from individual appearence (spatial variability), locutions (temporal 
variability) and lighting (spatiotemporal variability). In order to detect deformable 
templates, many approaches have been developped [5]. In this paper, we focuse on 
the Active Shape Model (ASM) proposed in [4]. The efficiency of ASM has been 
widely proved in the litterature for detecting pedestrians in [1], for localizing moving 
hands in [6] and for face modelisation in [7]. We explain how the method has been 
adapted to the difficult problem of lips edge detection. 
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In a training step, a model point for open mouth and a model point for closed 
mouth are defined. The deformations of such models are learned on a training base 
with multi-speakers of multi-speakings and are then limited to the principal ones after 
a Principal Components Analysis (see section 2). 

With ASM, there is a crucial problem with initialisation. Either initialisation is 
hand-made so that the processing is no more autonomous. Either initialisation 
involves algorithms which have a too high computational cost for real-time 
implementations [4]. In this paper, the initial shape is the average shape which comes 
from the training step and which has been automatically put and resized in the frame 
to be processed after the detection of four points of the model in a preprocessing step 
(see section 3). 

As written in [2] : " the real information in lipreading lies in the temporal changes 
of lip positions rather than in the absolute lip shape". Temporal information must be 
taken into account to improve lips boundaries detection. In section 4, spatiotemporal 
energy functions are proposed to deform the initial shape. 



2 Training Step 

In this section, two mouth models are defined and we learn how each mouth shape 
is allowed to be modified. This yields to an allowable deformation domain [4]. 



2.1 Models of Open and Closed Mouth 

A mouth is described with a set of discrete points called landmarks. We put the 
main landmarks as the comers of the mouth (or comissures), the arch of Cupidon... 
Main landmarks are in black on figure 1. Others points are equi-distributed between 
the main points. This yields to a 23 points model for a closed mouth and to a 30 points 
model for an open mouth (cf figure 1). It should be noted that these models do not 
contain any curve description (no parabol, no quartic), they belong to the class of free 
form deformable templates. 





Fig. 1. Landmarks distribution : open mouth and closed mouth models 
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2,2 Training Step 



•»$l-=Fin=p- 

Fig. 2. Training bases : left, 16 handly-labelled closed mouths ; right, 20 handly- 
labelled closed mouths 

For each model, a training base is handly built. Such a base contains 16 images for 
closed mouths and 20 images for open mouths (cf Figure 2). These frames come 
from different speakers with different locutions. With these bases, we learn the 
average shape for an open mouth (resp. for a closed mouth) and how an open mouth 
(resp. a closed mouth) is deformed according to the speaker and what he is saying. 
Figure 3 shows on the top the 16 closed mouths superimposed on each other. Each 
mouth is aligned on the first mouth of the base, the average mouth is computed and 
the position of each landmark in comparaison with the same average landmark is 
studied. 




2,3 Principle Component Analysis (PCA) 

PCA has been shown to be very useful because of its capability to reduce the 
dimensionality and to extract the important features in terms of amount of variance 
retained. 

A mouth shape is described by the coordinates of all the landmarks. Let vector x 
represents the set of these coordinates. For each shape in the training base, we 
calculate its deviation from the mean shape X . Such analysis yields a big matrix 
containing all the possible variation modes for each landmark. Keeping only the first 
four principal variation modes for each model represents more than 80 purcents of the 
total variance. 

Any valid shape x, in the sense of the training data, can be approximated by adding 
the weighted sum of that four variation modes to the mean shape [4]: 

X = X + Pb 

where P is a matrix containing the first four eigenvectors and b = (bi, b 2 , bs, b 4 ) is a 
vector of four weights characteristic of shape x. 
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Fig. 3. Mouths aligning and average shape 

3 Preprocessing Step 

3,1 Initial Shape: Average Shape 




Fig. 4. Average shapes : left, average open mouth ; right, average closed mouth 

Initial shape is the average mouth shape coming from the training step (cf 
Figure 4). 

3,2 Placing the Initial Shape 

Images acquisition 




Fig. 5 . micro-camera 
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In our project, the camera is interdependant with a helmet in order to center the 
mouth within the image (cf. figure 5). Frame capture is achieved at 25 images per 
second by a mono CCD colour camera. 

According to that frame acquisition system, the lips search area is limited to 1/4 
and 2/3 from the top of each frame as mentionned on the left of figure 6. 

Commissures and Cupidon's arch detections (points 1,9 and 5 of the model) 




Fig. 6. From left to right : lips search area ; white line of minima superimposed on 
mouth frame ; curve of Y-axis of the minima of luminance 

In order to automatically put the initial mouth shape, we are loocking for three 
particular points i.e. both comers and the Cupidon's arch (points 1, 5 and 9 of the 
model). This is an improved version of the robust initialisation proposed in [3]. 

By noting that at the frontier between both lips, there is a very dark line of 
luminance, commissures are detected as follows: on each column of the lips search 
area, the pixel with the lowest luminance is extracted. The white line at the middle of 
figure 6 represents on each column the pixel with the lowest luminance. We extract 
the position of both comers of the mouth by looking for the important jumps on that 
white line, that is by detecting both peaks on the curve of Y -axis of such minima (cf 
right curve of figure 6). 

In order to detect point 5, the vertical gradient of the luminance function is 
computed. All the negative values of that gradient are rejected because as the upper 
lip contours are concerned, vertical gradient values are positive: the skin is clearer 
than the lip. Point 5 is detected as the pixel of the frame which has the maximum 
vertical gradient and which is located on the upper vertical line going through the 
middle point between both commissures. 




Fig. 7. Automatic detection of points 1, 5 and 9 

Figure 7 shows the results of automatic detection of points 1, 9 and 5 on different 
frames. Results are quite good even in case of bearded men. 
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Detection of a 4**' point 



In order to completly define the translation and the resizing to be applied to the 
average initial mouth, we propose to detect another point which is the point with the 
maximum of hue gradient located on the vertical line going through the middle point 
between both comissures and below the comissures. Hue is computed as 

. . 256*ImG(x,j,f) 

hue[x,y,t } = 7 r as defined in [8] where ImG is the green 

lmR[x,y,t) 

component of colour frame and ImR is the red one. Figure 8 shows the result of 
detection of this point on different frames. The lower cross indicates the position of 
detected point (others points are the detected Cupidon's Arch and commisures). 




Fig. 8. automatic detection of the 4* point (cross on each frame) 



This point may then correspond either to point 27 or to point 13 of the model in 
case of an open mouth (respectively either to point 13 or to point 20 of the model for 
a closed mouth). In order to correctly identify the detected point, the distance d 
between the detected Cupidon's arch and the detected point is computed and this 



distance is compared with 
computed on the average 



= dist[pt^ ,pt 2 i) and = distiypt^ ,pt^^ 
open mouth model (respectively with 



d = distiypt ^ , pt and d 2 Q = dist(^pt^ ,pt 2 Q^ computed on the average 
closed mouth model). Following rules are then applied : 



• for an open mouth 

if ^ ^ detected po 'mi = 

else detected po int = pt^^ 

• for a closed mouth 

t/ [j-tijol ^ dQiected po 'mi = pt 2 Q 

else detected po 'mi = pt^^, 



As a result of such processing, on frames of figure 8, from left to right, the 
detected point is identified as pt 27 ,pt /3 and respectively. 
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Putting and resizing the average shape 

Initial shape is the average shape which must be adjusted to the frame to be 
processed. Points 1,9,5 and 20 (or 27 or 13) of the average shape are aligned with the 
position automatically detected and identified in the frame to be processed. 
Translations and rotations are necessary to make coincide points 1 and 9 of the 
average model with those detected on the frame. Points 5 and 20 (or 27 or 13) controll 
the increasing or decreasing of the average shape size. Figure 9 shows the initial 
shape before (average shape) and after (final initial shape) automatic positioning. On 
the right frame, the average shape has been enlarged because of the detection of both 
commissures. On this exemple, it should be noticed that the right position of 
commissures is not easy to detect even visually. 




Fig. 9. On each frame, the black mouth corresponds to the average mouth and the 
white mouth is the initial shape for lips detection after automatic translation and 
resizing 



4 Lips Detection Algorithm 
4,1 Energy Functions 

Starting with the initial shape, each point is moved perpendicularly of the contours 
in order to fit with a pixel of "maximum" information. 

Spatiotemporal gradient of luminance for upper lip 

Each point of the upper lip are moved according to the spatiotemporal gradient of 
the luminance function: the lip is darker than the face skin (spatial gradient) and with 
the hypothesis that the lighting of the scene is uniform, any motion induces a temporal 
variation of the luminance function (temporal gradient). Gradients U(x,y,t) and 
D(x,y,t) associated to up and down motions respectively are defined as following [9]: 

u(x,y,t) = ( 

D[x,y,t) = (^ + ^) * l(x,y,t) 



where G is a Gaussian filter 
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and I is the luminance function: 



l(x,y,t) = \^i^mR(x,y,t) + lmG(x,y,t) + lmB(x,y,t)). 



In order to be insensitive to the sign of U and D, they are converted into energy 
functions [9] : 



EXx,y,t) = G*U^(x,y,t) 
E^[x,y,t) = G^ D^[x,y,t) 



Luminance 








Luminance energy Ed 





Hue energy Ed 




Fig. 10. Comparisons between « luminance » energies and « hue » energies 

Top line of figure 10 shows and Ej for a frame. High values of energy are 
represented in white and low values are represented in black. High values are located 
near lips edges especially for the upper lip. 

Spatiotemporal gradient of hue for lower lip 

Same spatiotemporal energy functions are computed for the lower lip but in this 
case, spatiotemporal gradient of hue is considered. Because of shadow between the 
lower lip and the chin, the spatiotemporal gradient of luminance is not enough 
discriminant ; there is very little contrast between the frontier of the lower lip and the 
skin as you can see on the exemple on the left frame of first line of figure 10. On this 
frame, it is very easy to see the edge of the upper lip but it is much more difficult to 
see the edge of the lower lip. In such case, hue is much more discriminant (see first 
frame of second line of figure 10). Figure 10 shows the " hue " energy and the " 
luminance " energy for a same frame. For lower lip, " hue " energy is more discrimant 
than " luminance " energy. Indeed, the maxima of « hue » energy function are located 
at the edge of that lip (see the white edge at the bottum of lower lip on « hue » energy 
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frames of figure 10). In case of « luminance » energy, no such clear minima appear 
for the lower lip so that the detection algorithm should fail in finding right positions 
of the lower lip edge points. 



4,2 Iterative Constrained Deformation 




Fig. 11. Moving direction for each landmark (4 possible directions for the 
commissures) 

At each iteration, each landmark is moved to fit with the pixel of maximum 
energy among the n pixels above or under the current one (cf figure 1 1). The value of 
n is chosen to be dependant of the processed point. Indeed, the value of n is 
decreasing from the center of the model to the right and left comers. Middle points 
are allowed to moved faster than the points near the comers. Such evolution of n 
induces good results and decreases the number of iterations before convergence. 

The comers of the mouth are allowed to move not only in one direction but in four 
directions (cf figure 11) because it is not easy to determine the direction which is 
rightly perpendicular to the countour at these points. 

Once all the landmarks have been moved according to the maximum energy 
information, we obtain what we call the maximum energy shape. We search in the 
allowable deformation domain the shape which fits at best with the maximum energy 
shape. This process is repeated until the contour does no more change. Figure 12 
shows the result of lips detection for open mouths. Each mouth is very different from 
one person to another. But those differences have been taken into account in the 
training step because the training base is made of frames from different locutors. 
Results are thus good in all cases. Figure 13 shows results for lips detection of closed 
mouths. 

5 Conclusion 

We presented an adaptation of ASM for lips detection in a spatiotemporal 
framework. This yields to an automatic detection algorithm. A real time 
implementation of that algorithm is under study. Having a helmet on the head for that 
application should be a drawback. We are also developping an algorithm which 
localise the mouth in a frame whatever the camera centring is. The presented 
algorithm will be suited to others acquisition systems. 
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Fig. 12. Lips detection results in case of open mouths 




Fig. 13. Lips detection results in case of closed mouths 
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Abstract. Deformation tools constitute an important topic in computer 
animation. This paper shows how we can give an exact value to the ba- 
sic parameters in a dynamical elastic deformation system based on finite 
elements for 2D objects. We look for an optimal value for the time step 
{At) in the dynamical system, and an optimal area for the basic square 
finite element of the object. Fine time step adjustment is important to 
reduce the computational cost of the system, and guarantee the real- 
istic look of the result (final deformation of the object). Then several 
results from different physical conditions are compared, in order to find 
a good system of measuring the difference between them. Finally, using 
this measurement parameter we can relate the size of the finite elements 
with the error between several deformations of the same object. The de- 
formations are rendered using the Open Inventor application (VRML). 

Keywords: elasticity, render, VRML, Open Inventor, elastic deforma- 
tion, finite element, mesh generator. 



1 Introduction 

In this article we analyze an elastic deformation model based on finite ele- 
ments [1,2] (see [8] and [10] for more examples of similar models). We present a 
deformation model in the “physical model” category. A more detailed study of 
the deformation model classification can be found in [3], and in [3, 4, 5, 6, 7,8], we 
can see several examples of physical models. The aim of this work is to obtain 
the solution of the movement equation taking into account the elasticity param- 
eters [2] in order to obtain the best results with the lowest computational cost. 
In this way we can optimize on the one hand the spatial discretization of the 
object, and on the other the time step of the differential equation model. Thus 
we reduce the number of iterations in the dynamic system and the size of the 
matrix related to our deformation model. 

Section 2 presents a brief description of the deformation model, establishing its 
bases and obtaining the ordinary differential equation that will manage our nu- 
meric algorithm. Section 3 explains how to obtain the optimal mesh size to make 

* Suported by the project TIC98-0302 of CICYT. 
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the finite element structure. In Section 4 we refine the time step parameter in or- 
der to reduce the computational cost of the dynamic algorithm. An optimal value 
for the time step provides the numerical solution with the quantity of iterations 
necessary to reach the final position of the object. Thus the system does not 
make superfluous iterations, and guarantes the right strain wave transmission 
and system stability of the numerical algorithm. Section 5 defines the level of 
detail of the deformation process. In this section we define an error measurement 
that lets us compare several deformations with identical physical conditions on 
the same object with different finite elements sizes. We also present examples of 
this measurement process. The last section makes a short survey of the principal 
concepts laid out in this paper, and draws some conclusions about this work. 

The model discussed in this article is a physical model that tries to simulate 
deformation very realistically, by using algorithms based only on physical pa- 
rameters. Elastic deformation tools are useful in many commercial sectors and 
industries and can be found in generating deformable object animations, gen- 
erating computer animation, fixing strength conditions of certain materials, in 
architecture design, clothing factories, medical research, etc. 

The deformation model analyzed in this work is based on the elasticity the- 
ory (strain tensor and deformation tensor), and on the variational formulation 
presented in [2]. 



2 A Dynamic Model to Simulate Elastic Deformations 
Based on Square Finite Elements 



For the computational process, the object to be deformed is represented as a 
quadrangular mesh and the physical conditions corresponding to the deformation 
(the object’s physical properties, and external forces) are taken into account. The 
physical deformation can be calculated without user interaction, and represented 
dynamically with computer animation tools - VRML and Open Inventor. 



Let fl be the 2D object that we are going to deform. Using the deformation 
tensor (1) and the strain tensor (2) definitions, and relating them using Hookes 
law 



dui duj 
dxj dxi 



( 1 ) 



O'ij 




Sij , ij j — 1 5 2 , 



( 2 ) 



we can obtain the deformation process equation expressed in partial derivatives: 
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- E"=i = /i, * = 1, . . . , n en Qt , 
Mi = 0, i = 1, . . . , n en /q X (0, r) , 

= 0, i = l,...,nenri X (0, T) , 

Ml (-,0) = Mo,i, i = 1, . . . , M en , 

^(•,0) =Mi,i, 1= l,...,nenl2. 



( 3 ) 



where Qt = ^ {0,T) , F = dH = /oljA, and A is the part of the dH to 
wich we apply external forces, p represents the density of the object, /i are the 
external forces applied to the 17 and u is the vector deformation of the object. 



Using the technique described in [2] we can obtain the following equation 

pB^” + jBC + A^ = L^ pBC' + jBC + B~^A^ = B-^L . (4) 



In above equation: 



— 7 is the damping factor of the deformation, generated by the finite element 
analysis. 

— Matrix B represents the strain relations between the verteices of the object. 

— Matrix A represents the physical conditions of the object (internal forces) . 

— Vector L represents external forces which cause the deformation of the ob- 
ject. 

— Vector ^ represents the position of each vertex on the meshed object at some 
step of the deformation simulation. 



Performing a discretization process by time on (3) we can generate the next 
expression: 



pB 



^{t + At) - 2£, (t) + ^(t- At) 



AF 

F'-fB 



^(t + At)- ^(t- At) 
2At 



+ A^t)=L (5) 



The 7 factor is very important for simulating realistic deformation results. Null 
values generate deformation simulations without energy loss, which consequently 
present results with expanding and contracting movements for a long time. Thus 
using the equation defined in (5) we can reach the final equation of the dynamic 
process: 

+ ( 6 ) 

A simulation of the deformation is simply a question of solving the linear equa- 
tion (6) where the unknown values are the positions of the vertices of the object 
at the instant t + At, represented in (6) by the ^t+At vector. 
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3 Fixing an Optimal Size for the Mesh Finite Elements. 
Level of Detail in Deformation 



In this section we fix the optimal size of the object’s finite elements, in order to 
find the minimal vertex quantity in the object that produces a stable deformation 
process, (the right strain waves caused by external forces applied to the object). 

First, we represent external forces / using the spatial and time components: 
f{x,y,t) = fe{x,y) ■ where fe is the spatial component and ft is the time 

component. Once that is done, we calculate the Fourier transformation of the 
component expressions F{w). We then define a spatial interval [wm,WM] 
where F{w) function is greater than a certain upper boundary. In association 
with the lower and upper frequencies Wm and wm we can define the minor and 
major periods as Tm = and Tm = Then, transverse and longitudinal 
wave transmission velocities through a material with Lame A and /r coefficients 

are Vt = ; where p is the material’s density. Thus, the transverse 

and longitudinal wave of the strains caused by the forces / in the maximal and 
minimal period are: 



Am.t — Xf * T^, — Xt ' , \m.l — Xl • Xm,1 — Xl ' Tj\^ . 

Now we must take mesh h for the finite elements to verify 



h < 



minlAj7T,^i, 



T„, 



10 



10 



27T 

Wm 






100 • p 5 wm 



By way of example of all the concepts laid out in the text above, we can apply 
a 1 N force on the examined object for 1 second. From a mathematical point of 
view, the time component of the force can be expressed as 



Mt) = 



1, si t G (0, 1), 
0, otherwise. 



The value of the Fourier transformation is F{w) = The picture 

of |F’(w)| can be examined in Figure 3. We can take Wm = 0 and wm = 20 using 
the symmetric aspect of |F’(rc)|, (|F(20)| « 0.0854). Thus the mesh size h must 
obey the following expression: 



h < 



5wm 



7T 

Too 



4 Time Step Adjustment for Optimal Processing 

To obtain the maximum speed of calculation we look for a At value which 
enables us to make the necessary number of iterations whilst maintaining system 
stability. 
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Fig. 1. ft{t) Fourier transform plot 



The value chosen for At must be less than the smallest transmission wave period 
of the system, to guarantee correct propagation of the strains inside the deformed 
object. This kind of analysis normally generates very small time step values. In 
order to estimate our At value we suppose that the smallest wave period is T„, 
where T„ = ^ and is the absolute value of the largest eigenvalue associated 
with the generalized and nonlinear eigenvalue problem of the final differential 
system 

A(j) = uj'^pBcj) , 

where A is the strain matrix, B is the mass matrix of the deformation model, 
and is the biggest eigenvalue of the system. Thus, for fine odjustment, the 
recommended value of At would be approximately 

Now we have all the necessary items to calculate Tn, and the problem associated 
with calculating At in our system is reduced to a generalized, nonlinear eigen- 
value problem. We can see a detailed method for efficient computation of the 
generalized and nonlinear eigenvalue problem in appendix A. 



5 Level of Detail in Computing Deformation Simulations 

The importance of the definition of the level of detail concept in elastic defor- 
mation models is rather similar to that of the level of detail concept used in 
3D scene visualization. That is, depending on the precision with which we wish 
to simulate the deformation of the object, we must apply one size or another 
for the finite elements of the object in the mesh generation process. A greater 
value for the level of detail parameter produces a smaller matrix for the dynamic 
deformation process. 

To determine the meshing level needed to compute the results with a predefined 
precision, we need a measurement tool that lets us know the error generated in 
the simplification of the problem. In our model we define the deformation error as 
the difference between the deformations calculated starting from certain physical 
parameters. Which are selected according to the future use of the deformation 
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results. Thus, it is possible to render the same deformation with different levels 
of detail, depending on the future use of the simulation results. 

Hence, a deformation simulation - which is good enough for some specific pur- 
poses - can be obtained at a low computational cost. 



5.1 Computing Differences between Several Deformations 

In order to detect the differences between deformation results, we measure the 
deformation of the boundary of the object under the same physical conditions, 
but with disctint levels of detail; applying the following function: 



E{ro,r,) = 

r ^J{So is) - S, (s))^ + {Ko is) - K, {s)f ■ {s)f + {y[ {s)f ds , (7) 

J Sq 



where Iq and Fi are the curves associated with the boundary of two deformed 
objects represented with different levels of detail but the same physical features. 
5o (s) and S\ (s) are the associated arc length for the parameterization of the 
curves with s G (sq,si): 



S,{s)= f x[{sY + y[{sYds , 

J So 

where Tj(s) = (xi(s), yi{s)), i = 0, 1 . To fix the concepts, we suppose that the 
finest mesh is associated with curve A- 

The reason for the <S'i(s) factor’s presence in the defined function (7) is because 
if S{x) is defined as the arc length of the parameterization on any curve, we 
see that normally the expression ip [S (a;)) • S' (x) dx does not depend on the 
parameterization involved in the process: 



f fxo ^ (2^)) ■ (^) = /to ^ ^ = /to ’ 

I t = S' (a;) , 

I dt = S' (x) dx , 



Therefore, with the definitions of the paragraph above, we can say that results 
generated by the use of the function E in the curves to compare is not jiffected 
by the parameterization used. For this reason we define the function E as the 
measurement of the existing differences between two examined deformations. 
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5.2 Fixing the Error between Our Optimal Deformation and others 
Deformations 

The algorithm to compute the differences between two deformations is: 

1. We apply the same deformation conditions to two objects with different level 
sof detail in their finite element mesh, in order to compare the results of each 
deformation. One of the finite element mesh size corresponds to a level of 
detail to produce good deformation results. The second mesh size will be a 
simplification of this precise mesh (finite elements greater than the precise 
one, Figure 2). 

2. In order to apply function E on each spline as shown above . We define this 
computed value as the error present in the simplified deformation. 

5.3 Example of the Error Computation Generated by a Simplified 
Deformation 

We take the same object meshed at two different finite elements sizes (fine level 
of detail and relaxed level of detail), and we deform them with those physical 
parameters: (Figure 2): 

— Vertices in P are the fixed vertices of the object during the deformation 
process. 

~ On the vertices in set F we apply the following force: {Fx = —5N, Fy = 
—5N) for 0 to 1 seconds. 

~ The rest of the object’s vertices are free vertices without any external force 
on them. 

— The material properties of the deformed object are: density (1.8 gjcm^), Pois- 
son’s ratio (0.22), softening coefficient (3), and Young’s module (lOOOGPa). 

— We compute the deformation over 2 seconds. 




Fig. 2. Simulating a deformation in an object meshed at different levels of detail 
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Fig. 3. On the left we have the spline obtained from the simplified deformation. 
On the right we have the spline from the precise deformation 



In order to simplify the presentation of the results we compute the error in only 
part of the boundary. The differences can be seen in Figure 4. 

Two cubic splines obtained from respectively deformed boundaries can be ob- 
served in Figure 3; the value of the differences generated by the simplification 
process is 1.924. 



6 Conclusions 

In this paper we have presented several benefits to applying a finite element 
deformation model : 

— We have calculated the optimal At to apply in the dynamic process. We then 
obtain the consequent reduction of the number of iterations while maintain- 
ing the stability of the deformation model. 

~ We define a precise finite element size to mesh the object in a way that 
generates a finite element mesh that guarantees a good transmission of the 
strain waves caused inside the object by the external forces. At the same 
time we get the precise number of vertices that must be in the object to 
reach a good solution without using superfluous information. We this reduce 
the size of the matrix in the dynamic problem. 

~ If we are using elastic deformation tools in animation work, precision is not 
the main goal, so we have defined a system to compute the error produced 
by a simplified deformation problem with a low computational cost. 

— In future work, we are going to generalize all the topics to 3D. The level of 
detail is more important if we are working with 3D objects. The generaliza- 
tion of the defined function (7) for the surfaces can be done by using their 
first and second fundamental forms. 

Appendix A: Efficient Computing of the Generalized and 
Nonlinear Eigenvalue Problem 

The basic generalized and nonlinear eigenvalue problem is: 

A(j) = uj'^pBcj) , 
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where A and _B are symmetric and is positive definite. We can convert it into 
a equivalent system: 

(j) = ui'^cj) . 

Now we must calculate the inverse matrix of B and this process is computa- 
tionally inefficient. In addition the matrix (^B~^A^ is not symmetric. Thus the 
efficient way to solve this problem is to solve an equivalent system in which the 
eigenvalues are the same as those of the original problem. 

We can recover a symmetric eigenvalue problem using the Cholesky decomposi- 
tion in the mass matrix 



B = L-LA' . 

Multiplying the equation above by L~^, we get 

C {^^4') where C = L~^A . 




Fig. 4. On the left we have the results from the simplified deformation. In the 
right we have the results from the precise deformation. Rendered time steps 
correspond to 0, 0.33, and 1.5 seconds 
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Matrix C is symmetric and its eigenvalues are the same as those of the original 
problem, so we need only calculate the matrix C. To do this we need to compute 
L~^ and we must also make an inverse computation. The efficient way to form 
C is first to solve the equation 

Y -L^ = A , 

for the lower triangle of the matrix Y , and then solve 

L-C = Y , 

and obtain the lower triangle of the symmetric matrix C. 

Once we get matrix C, to obtain its major eigenvalue we can apply the method 
to it. Now we can obtain the period T„. 

In our case the computational cost of the Cholesky decomposition is on the order 
of O(n^), because the mass matrix B in our model is maximum with nine values 
different than null for each row. 
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Abstract. A novel technique is described for analysing human 
movement in outdoor scenes. Following initial detection of the humans 
using active contour models, the contours are then re-represented as 
normalised axis crossover vectors. These vectors are then fed into a neural 
network which determines the typicality of a given human shape, allowing 
for a given human’s motion deformation to he analysed. Experiments are 
described which investigate the success of the technique being presented. 

Keywords: Fluman, Motion Analysis, Shape, Snake, Active Contour, 
Neural Network, Axis Crossover Vector 



1 Introduction 

This paper outlines a meehanism for analysing the motion and deformation of 
walking humans. A method based upon aetive eontour models, snakes [1], and a 
neural network for eategorisation is used [2, 3]. In previous papers we have discussed 
the classification of human shape in isolated static images taken from a motion 
sequence; in this study we discuss an individual’s shape deformation during motion. 
The method discussed is part of a larger system designed to track moving 
pedestrians, a problem that has been the subject of much research [4, 5, 6]. We show 
that the periodic nature of human walking is clearly discernible from the deformation 
pattern, and that individual humans have a specific temporal pattern. 

The paper is divided into 6 sections. Section 2 discusses the use of active 
contour models for detecting walking humans. Section 3 discusses the issues faced 
when using active contour data to train neural networks, and presents a solution in 
the form of the axis crossover vector. Section 4 summarises previous experimental 
findings using neural networks to identify human shapes, which leads into current 
work on human motion analysis using neural networks (section 5), along with 
experimental methods and results. Finally section 6 discusses the findings of this 
paper and details future work in this area. 



H. H. Nagel and F. J. Perales (Eds.): AMDO 2000, LNCS 1899, pp. 48-57, 2000. 
(c) Springer- Verlag Berlin Heidelberg 2000 
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2 Identifying and Tracking Moving Humans Using 

Snakes 

2.1 Snakes 

In order to identify and track human outlines, the basic human shape must first be 
identified, and we use active contour models, ‘snakes’. A snake is an energy 
minimising spline which can detect objects in an image and track non-occluded 
objects in a sequence of images. Snakes can be optimised for detecting and tracking 
particular classes of objects by customising their energy function so that detection of 
the desired characteristics, for example the curvature of an object’s outline, or the 
presence of a particular colour, results in a reduction of the snake’s overall energy. 
This reduction in energy has the effect of attracting the snake towards these desired 
features. Examples of a snake tracking a human can be seen in Figure 1. The active 
contour model used in these experiments was based on the Fast Snake model [7] as 
it allows for more autonomous object detection and tracking than the original model 
[1], although the techniques discussed in this paper are independent of the particular 
active contour model being used. 
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Figure 1: A target human detected and tracked in a sequence of frames, depicted here by 
the results of two discontiguous frames, using an active contour model. The target human 
has been dimmed in this figure to increase the snake’s visibility 

Once the snake’s energy function has been defined, the user clicks an initial 
polygonal contour around the target object in the first frame of a movie. The snake 
then minimises its contour’s energy according to its energy function, causing it to 
lock onto features in the image which are defined as salient by the energy function. 
Once the snake has stopped moving it is copied, in its relaxed position, into the 
next frame of the movie, where it again minimises its energy in an attempt to lock 
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onto the target object in its new location in the image. This process continues until 
the end of the movie is reached. The success of the snake in detecting and tracking 
the target object is largely dependent on the relevance of the snake’s energy function 
to the target class of object; the image quality following image preprocessing; the 
frame rate and speed of movement of the target object; the distance the contour’s 
control points are from each other; and the level of occlusion of the target object. 
Once a snake loses focus from the target object, it rarely regains the successful 
detection and tracking of that target object. 

A more detailed discussion of active contour models and their energy 
functions can be found in [8] and [9]. 

2.2 Tracking Walking Hnmans 

In order to produce a clean and reasonably varied set of data, the 3D modelling and 
animation package ‘Poser’ [10] was used to simulate human movement. Snakes 
were relaxed around simulated humans in 30 different movies, all of which contained 
a single human walking from left to right. Each movie contained a different walk 
style and/or human build, providing a range of simulated human shapes and 
motions. All movies were 120 frames in duration, which allowed for at least 2 paces 
per person, despite individual differences in the length of stride and speed of 
movement from one human to another. 

As described earlier, a snake is initialised around the human in the first 
frame of the movie. In subsequent frames, the snake is allowed to autonomously 
relax around the walking human, using its position in the previous frame as an 
initialisation in the next frame. In such a fashion the human is identified and tracked. 




Figure 2: A target human detected and tracked through a sequence of 120 frames using 
an active contour. The frame numbers are shown below each frame; intermediate frames 
have been omitted. Above each frame is the pose identifier, referred to in the results 
sections of this paper. The target human has been dimmed in this figure to increase the 

snake’s visibility 

Figure 2 shows the results of a snake tracking a sample human over 120 
frames of video footage. No objective measure exists of an active contour’s success 
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at tracking objects [8], but here the model was clearly successful as the humans were 
tracked in all of the 30 different instances. Speed is not a primary concern in this 
work, but it is worth noting that the relaxation times for the snakes in moving from 
one frame to another was often very fast. 

As can be seen from Figure 2, the snake only obtains the outer edge of an 
object; ‘holes’ in between the human’s legs do not form part of the contour. Despite 
the variation that humans adopt as they walk, their outer contour is identified in all 
cases. 

3 Axis Crossover Vectors for Representing Human 

Shapes 

3.1 Representation 




Figure 3: A human contour being re-represented as a 16-axis crossover vector. The 
snake’s contour (left) has 16 axes projected from the contour’s centre to its edges, giving 
16 measurements (top vector). This vector is then normalised (bottom vector), to make it 
scale invariant, providing a compact representation for use in neural networks (visualised 

on right) 

Once the target object has been detected and tracked in a movie, each frame’s 
resultant contour can be stored as a vector of that contour’s control points ((x,y) 
coordinates in the image), with vector length n where n is the number of control 
points along the snake’s spline. However, this vector is not ideal for the purposes of 
shape analysis. By storing the absolute (x,y) locations of the control points, the 
active contour is location dependent; identically shaped contours in different parts of 
the image will have different vectors. The contour’s vector is scale dependent too; 
differently sized contours which share the same shape will consist of different 
vectors. The contour is also rotation dependent; two identically shaped and sized 
contours will appear to have different vectors if they both consider different control 



52 



K. Tabb et al. 



points to be the ‘first’, as it is this eontrol point which appears first in the vector. 
Finally, the length of the vector is determined by the number of control points along 
the contour, therefore identically shaped and sized contours which contain different 
numbers of control points will result in vectors of different lengths. 

All of these factors make the comparison and analysis of active contours 
difficult. For the purposes of being used in neural networks, the native active 
contour vector is additionally unsuitable because it contains pairs of data, the x and y 
coordinates of the control point, requiring the network to associate an x coordinate 
with the correct y coordinate before analysis can take place. 




Figure 4: Axis length deformations over time, taken from a single walking human. The 
graph shows the differences in lengths of 3 of the 16 axes, at 0°, 180° and 205° from 

vertical, during two paces 

One possible solution to the problem of analysing active contours is the 
axis crossover representation, presented in [2] and shown in Figure 3. A number of 
axes are projected outwards from the contour’s centre, with the distance along those 
axes from the centre to the furthest edge the axis meets being stored in a vector. This 
vector is then normalised by its largest element, making the vector scale, location 
and rotation invariant. Furthermore such axis crossover vectors are independent of 
the number of control points on the contour, as the crossover vector’s length is 
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equal to the number of axes being projeeted, not to the number of eontrol points on 
the snake. For more eomplieated situations the axis crossover representation offers 
more flexibility than discussed here; a more detailed description of its features can be 
found in [2]. 

3.2 Analysing Deformation 

Having tracked the object with a snake, the relaxed contours are then converted into 
axis crossover vectors as described in Section 3.1. Figure 4 shows the deformation 
of 3 axes as a human walks along. Note that each axis’ pattern of deformation is 
cyclical, closely repeating itself once per pace. Furthermore some axes are deformed 
more than others. The axis at 0°, which typically measures the distance from the 
contour’s centre to the top of the head, shows little difference in movement during 
the walking motion. This reflects the fairly constant height that the head keeps from 
the centre of the body while walking. Conversely the axis at 180°, which typically 
measures the distance from the contour’s centre to the crotch or feet (in the case 
where the legs are upright and the feet lie below the crotch), varies radically within 
each pace, reflecting the opening and closing of the human’s legs. The length of the 
axis situated at 205° is has negative correlation to the 180° axis, where the opening 
and closing of the leg is again responsible for the axis growing and shrinking. Thus, 
when the leg is outstretched, the 180° axis is short whereas the 205° axis is long, 
and conversely when the human is standing up straight, the 1 80° axis is long and the 
205° axis is short. Both the 180° and 205° axes extend and shrink only once per 
pace, coinciding with the stride being taken. 

4 Identifying Human Shapes with Neural Networks 

A range of neural network experiments have been performed to validate the axis 
crossover’s ability to represent human contours in a scale-, location- and rotation- 
invariant manner [3]. Axis crossover vectors of different sizes were used, ranging 
from 4 axes to 24 axes, to identify the most appropriate axis crossover description 
for human shapes. The number of axes used has a direct relationship with the 
number of input units in the neural network: each axis’ length is the input for one 
input unit. In the double output case one output unit was trained to represent human 
shapes, and the other non-human shapes. A range of hidden layers between 2 and n-1 
units, where n is the number of input units in the network, were tested in order to 
identify the best generalisation ability. 

Training data contained 150 human shapes, and 150 non-human shapes 
which consisted of ‘outdoor furniture’ shapes such as cars and streetlights. Active 
contours were relaxed around these shapes, and were then re -represented as axis 
crossover vectors, which could then be fed into the neural network. 

Test data contained 10 unseen human shapes, and 10 unseen non-human 
shapes, again of outdoor furniture. 

As is summarised in Figure 5, the experiments found that 16 axes offered 
the most suitable level of detail for encoding human contours using axis crossover 
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vectors. Consequently a 16 input unit network was chosen for all further 
experimentation. In addition, by using two units in the network’s output layer, a 
confidence value can be obtained which allows a crude measure of how ‘human’ the 
network considers the given axis crossover vector to be. The confidence value is 
simply the difference of the two output units’ values. 




Figure 5: Results of neural networks classifying human and non-human axis crossover 
vectors. Left: Axis crossover vectors which used 16 axes were found to give the best 
results. Right: The mean confidence value when presented with human shapes (left pair of 
columns) and non-human shapes (right pair of columns) 

5 Human Motion Analysis Using Neural Networks 

As described in the previous section, a trained neural network, when presented with a 
human shape represented as an axis crossover vector, can produce a scalar value 
measuring the confidence of the network that the shape is human. The network used 
in section 4 was tested with 30 different simulated human motion sequences, each 
varying in weight and / or gait. 

The motion patterns are periodic, repeating here after the 60th frame, or at 
the start of the second pace. This can be seen in Figure 6, where two consecutive 
paces are superimposed for each of the 3 humans. The frequency of a human’s 
motion pattern is attuned to the frequency of their walking, so that the speed of 
walking can be identified from the confidence value/time graph. 

A human’s motion pattern forms a signature specific to that human. Whilst 
the motion pattern for a human is not identical from one pace to another, in the 
same way that poses A and E in Figure 2 are not identical, it can be seen from 
Figure 7 that motion patterns for a given human are more similar to each other than 
to other humans’ motion patterns. 
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Figure 6: Analysis of human motion signatures. Each graph shows the neural network’s 
confidence values over time for a given simulated human, with two consecutive paces for 
a corresponding simulated human being superimposed over one another. The graph on the 
left shows the signature for a simulated human with average overall confidence, whilst the 
middle and right graphs show the signatures for the simulated humans with highest and 

lowest overall confidence 




Figure 7: Neural network confidence value plotted against time for three simulated 
humans. 120 F rames corresponds to two paces. The pose identifiers labelled along the 
top of graph correspond to those in Figure 2 

As the human walks, this confidence value will change, reflecting the fact 
that the typicality of the shape of a walking human varies over time. This can be 



56 



K. Tabb et al. 



clearly seen in Figure 7, where for some poses the network classes the humans as 
more non-human than human. See for example humans 2 and 3 in poses B and F. 
Moreover, there is considerable differences between humans so that, for example, 
human 1 is consistently classed as more human than humans 2 and 3 across all 
frames. 

6 Discussion 

In this paper we have shown how snakes can be used to identify human contours in 
an image, and then to track a moving simulated human. These contours can then be 
re-represented as fixed length, scale-, rotation- and location-invariant vectors. When 
the vector for a walking human is analysed over time, the deformation pattern of the 
human contour is observable , clearly showing a differential deformation pattern. 

With a suitable set of images, a training set can be formed from the axis 
crossover vectors for a neural network classifier that identifies shapes as human or 
non-human, and it is the time dependent output of this network that can be used to 
analyse the motion pattern of a single walking human. As shown in Section 6, the 
output of the network is periodic, identifying single paces, so that speed of motion 
can be gauged. Moreover, individual differences between humans are apparent and 
these signatures could be used as identification tags. We are interested in extending 
this work to deal with the problems associated with occlusion. Our aim is to 
produce a model that will combine active contour models and neural networks and 
which will be able to track humans even during differing levels of occlusion, which 
will combine with previous studies on occluded human shapes [2]. This will involve 
prediction of the motion deformation pattern of a given human, which will in turn, 
involve the collection of a wider range of real human data, including footage of 
humans walking towards or away from the viewing point, and of real world noisy 
images. 
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Abstract. The study of the motion of deformable models is one of the 
most important topics in computer animation. The model proposed by 
Terzopoulos et al. [16] uses Lagrange’s equation of motion and a varia- 
tional principle to find the change of shape of deformable objects. We 
present in this work several methods for solving this equation numeri- 
cally, and we study its complexity and stability, in order to find the best 
one. These methods have been classified in explicit and semi-implicit 
methods using central finite-difference methods and non-central finite- 
difference methods. We will see that the explicit methods are better 
than the semi-implicit ones because of lower computational cost in order 
to create the animation of non-highly deformable objects. Moreover, we 
will see that the behavior of non-central finite difference methods are 
better because they are stable. The study has been made for surfaces 
but can be easily generalizable to 3D curves and solids. To illustrate the 
applications we have displayed the animation of a held handkerchief and 
the free fall of a tablecloth over a table. 

Keywords: Physically-based modelling, lagrangian dynamics, surface 
deformation, deformation energy, simulation, finite-difference methods, 
stability, animation. 



1 Introduction 

Deformations play an important role in Computer Graphics, allowing objects to 
assume interesting new shapes. Deformable objects are inherently more difficult 
to model than rigid objects. Since 1986 ([20,6]) , the Computer Graphics com- 
munity has devoted much attention to the problem of modelling and animation 
of deformable objects, such as cloth and flesh, based not only on the Geomet- 
rical/Mathematical intrinsic characteristics of surfaces and solids, but also on 
their material properties in the context of continuum and discrete mechanics . 
Following [9] the deformation-modelling techniques can be classified into three 
categories: geometrical, physical and hybrid. The ^'geometrical techniques" are 
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the best in the computational cost sense, but they don’t consider the physi- 
cal properties of objects. The disadvantage is the poor realism of simulation 
deformation. See, by way of example, [1,20] and [3]. Some of the main goals 
of this activity has been to produce animations of deformable objects with a 
high degree of realism. This approach has been included in a broader modelling 
framework, entitled ’’physically modelling techniques ” , where the objects are 
modeled and moved following the physical laws that govern their static and/or 
dynamic behavior. See [16,17,18,2,14,12] and [11] for more details. The problem 
is that the complexity and computational cost, in almost all the cases, is very 
high. To avoid this drawback the so-called ’’hybrid techniques” have been devel- 
oped. These techniques use concepts and methods from the two previous ones 
to view the deformable objects and their motion in as real a way as possible, 
thus reducing the computational cost and complexity using initial geometrical 
approximations. We can see in [13,19,7,10,5] some examples. An extended study 
of the previous classification can be found in [9] . 



Coming back to the initial problem, the literature on modeling and animation 
of deformable objects shows that physically based techniques should be used 
to build and animate realistic deformable models. Another aspect in realistic 
animation is modeling the behavior of deformable objects. In order to do this we 
should approximate a continuous model by using discretization methods, such 
as finite-difference methods and finite-element methods (see [14], and [11] in the 
latter case). In the finite-difference discretization case, a deformable object is 
approximated by using a control points grid, where the points are allowed to 
move in relation to each other. The way that the points are allowed to move 
determines the properties of the deformable object. Included in these methods 
there are those that try to find the object motion by using Lagrange’s motion 
model and by defining an object functional energy that measures the elastic 
deformation potential energy for the body. This functional can be based on the 
fundamental theorem of surfaces of Differential Geometry, as in [16] and [17] (see 
also [8,7,10]). Our way to solve the deformation problem belongs to this type of 
methods. 



Compared to the former work by Terzopoulos et al. and Giidiikbay and 
Ozgiig ([8]) our work can be viewed as a continuation and verification. Our 
model is similar but differs in the linear algebraic system associated to the nu- 
merical resolution of the differential equation related to the model, to obtain the 
animation of deformable objects. Using these ideas as a guide we present briefly 
in the next section the basic principles of the deformable model equation. In 
Section 3 we describe several finite-difference schemes to solve the deformation 
equation numerically and to obtain the linear algebraic system associated with 
it. To see which is the best way to solve the system we have carried out, in Section 
4, a study of stability and complexity depending on the matrix which represents 
the deformation information. Section 5 is devoted to presenting some simulation 
results and finally we give some conclusions and suggestions for further research. 
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2 Equation of Elastically Deformable Model 



Let be our object. We suppose that J 7 is a surface with local coordinates 
p = {xi,X2) where (x\,X2) € [ 0 , 1 ] x [ 0 , 1 ] (we suppose that the domain of 
parametrization of fl is the unit square). The position of an object point at time t 
will be r(p; t) := r{xi,X2', t) = {ri{xi,X2;t), r2{xi,X2;t),r3{xi,X2;t)). The initial 
position of the object is r°(p) := r'^{x\,X2) = {r^{xi,X2),r2{x\,X2),r'l{x\,X2))- 
The next step is to introduce the deformation energy. To do this, we are going 
to take into account the fundamental theorem of surfaces: two surfaces have the 
same shape if their metric tensor G and curvature tensor B are the same, where 



G^j = G^j{r{xl,X2■,t)) 



dr dv 

dxi dxj ’ 



Bij = Bij{r{xi,X2;t)) = n 



d^r 

dxidxj ’ 



and n is the normal unit surface. See reference [ 4 ] for a detailed discussion of 
these formulations. Using the above tensors we define the deformation energy 
for object 1 ? at time t as: 



e(r) = 



E 

*, 1 = 1 ' 



a 






dxidx2, 



where pu represents the object resistance to length deformation, 7712 = 7721 rep- 
resents the object resistance to shear deformation, represents the object re- 
sistance to bend deformation and ^12 = ^21 represents the object resistance to 
twist deformation. 

The motion equation for a deformable model can be written in Lagrangian 
form (see [ 15 ]) using the previous functional as 



d f dr \ dr 

di V ^^~dt 



fe(r) 

5 v 



/(rU), 



( 1 ) 



where is the first variational derivative for the deformation functional and 
/(r,t) is the net externally applied force. The first term is the inertial force 
due to the mass distributions. The second term is the damping force due to 
dissipation, and the third term is the elastic force due to deformation of the 
object away from its natural shape. 

To create animations with this deformable model, the differential motion 
equation should be discretized by applying finite-difference approximation meth- 
ods and solving the system of linked ordinary differential equations of motion 
obtained in this way. But, first of all, it is necessary to simplify the first varia- 
tional derivative of functional e(r). Concerning the approximation of it can 
be seen in [ 16 ] that 



fe(r) ^ d f 5r \ f S ^ 

i5r dxi \ dxj ) dxidxj \ dxidxj ) ’ 



( 2 ) 



where = 77^(77; t) {Gij - G° ) and = ^^(p; t) {Bij - B°j) . 
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3 Numerical Resolution of Deformable Model 

Equation (1), which represents the surface movement and its deformation at the 
same time, can be solved using finite differences in both spatial and time coordi- 
nates. In order to do that we approximate spatial derivatives in order to trans- 
form the partial differential equation (1) into an ordinary differential equation. 
So, using any of the finite-difference schemes described below, expressing the grid 
function r(m, n) as the MN dimensional vector r in the grid vector notation, 
we transform the partial differential equation (1) into the following differential 
equation: 

+ + = <=*> 

where M is the MN x MN diagonal mass matrix, C is the MN x MN diagonal 
damping matrix and K(r) is the MN x MN stiffness matrix which contains 
all the information about the finite spatial discretization and the deformation 
properties of the object. Note that M is the mass matrix with ^{m,n) as di- 
agonal elements, and C is also constructed similarly from 7 (m, n). As we can 
see below, the matrix K(r) is the most important in the numerical resolution of 
equation (3). 

To solve the differential equation, we use the following discrete time approx- 
imations: 

dr ^ —L{t — d'^r ^ r{t + At) — 2r (t) +r{t — At) 

dt 2At ’ dt"^ At"^ 

Independently of the of finite-difference scheme used to discretize the spatial 
coordinates, to solve equation (3) we have to solve the following linear system 
of equations, associated to an explicit integration procedure for the ordinary 
differential equation: 



A ■ r{t + At) = h{t), (4) 

where A = 2 §t and b (t) = f (t)- (K(r (t)) - ^)r{t)+{^^ - ^)r{t- 

At) . The previous system is easy to solve because matrix A is diagonal. We can 
write: 



r (t -I- At) = A ^ • b (t) := S • b (t) , 

where Sij = 0, if f yf j and Su = -I- ^ ■ Then, the original nonlinear 

partial differential equation (1) has been reduced to a sequence of diagonal linear 
algebraic systems done by (4). 

This fact is different to the numerical integration through time carried out 
by Terzopoulos et al. As we can see in [16,17] (also in [8] ) they obtain a semi- 
implicit integration procedure 

A{t) -r{t + At) =g^. 



( 5 ) 
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where matrix A (i) = K(r(t))+(^ + 2§j) andg^= fj+^-r W + - ^)- 

r (t — At). 

To solve equation (1) numerically we use the standard finite-difference ap- 
proximation operators. In the first place we use a scheme based on central Bnite- 
difFerences, where the difference operators are done by 



Ml ~ h(ai-|-Zia:i,X2)-h(ai-Axi,X2) • Ml 

dxi ~ 2Axi ’ ^ Similar way loi , 

9=h _ h(x, + Ax,,X2)-2h(2.p)+h(x,-Ax,,X2) ^ ^ 

— h(xi — Axi ^X2-\- ^X2)-\-h.{x\-\- Ax\,X 2-\- ^X2)-\-h.{x\ — Ax\ ^X2) — h.{x\-\-Ax\Ax\,X2) 
dx\dx2 ~ Ax\Ax2 



dx\ 

d^h 



and so on up to fourth order derivatives. In this case the approximation error 
order is 2. Next, we use a non central Rnite- difference scheme where the approx- 
imation error order is I. This method was used in [16] and [17] (see also [8]). We 
define the following forward and backward first difference operators: 



D+\i{xi,X2) = ^ D^\v{xi,X2) = ^ 

D+h(xi,X2) = Z72'M^1,^2) = 



Using the previous difference operators we define the second difference operator 

by 



D^2^{xi,X2) = £>Jih(xi,a;2) = {D^h.{xi,X2)), 

Diih{xi,X2) = D^{D^\i{xi,X2)), 

D^^h.{xi,X2) = £>2ib(xi,a;2) = T>]'(i72'h(a:i, 2 : 2 )), 
L>22h(a;i,ai2) = £>^(D^h(a;i, X2)). 



Now, using a grid of points to represent the continuous surface (where we 
apply the deformation) we can discretize the expressions involved in equation (1) 
using the approximate variational derivate done by (2). 

Another aspect to take into account in the discretization process of equa- 
tion (1) is the surface boundary condition. We have considered two different 
boundary conditions. First of all we introduce a free (natural) boundary condi- 
tion on the edges of a surface where the difference operator attempts to access 
nodal variables outside the discrete domain, setting the value of the difference 
operators to zero. This is equivalent to moving the boundary so as to amplify the 
square domain [0,1] x [0,1] and turn the values r(a;i,a: 2 ) of the old boundary 
into the r(a;i,a; 2 ) values of new boundary. After, we consider regions of dis- 
cretization in the boundary. We make this clear with the first and second order 
finite-differences. In this case, we divide the square domain [0,1] x [0,1] into 
nine regions, as can be seen in Figure 1. In each of these regions we construct a 
new set of difference operators. For example, in zone VIII we have the following 
expressions for the approximations for the difference operators: 
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ah 

dxi 

ah 

dx2 

a^h 

a^h 

dxl 

a^h 

dx\dx2 



3h(xi,a;2) - 4h(a;i - Axi,X2) + h(xi - 2Axi,X2) 
2Ax\ 

h.{x\,X2 + Ax 2) — h(a;i, a;2 — AX2) 



2Ax2 



1 



^^(2h(xi,X2) - 5h(a:i - Axi,X 2 ) + 4h(xi - 2Z\a;i,X2) 
h(a:i - 'iAxi,X2)) 

h(xi,X2 + AX 2 ) - 2h.{xi,X2) + h(xi,2;2 + AX 2 ) 

Axl 

■ . ^ , — (h(a;i,a; 2 ) - 2h(a;i - Axi,X 2 ) + 3h(xi - Z\xi,a :2 
Al/\Xi/aX2 



Ax2) 



- h(xi - Axi,X2 + Ax2) - 2 h.{xi,X 2 - AX2) + h.{x\,X2 + AX2) 

+ h(a:i — 22ia;i, a; 2 ) — h(xi — 2Z\xi, X 2 — AX 2 )) 

and so on with all the regions and each of the derivatives appearing in equa- 
tion (1). In the case of third and fourth order derivatives we have twenty- five 
different regions. 




Fig. 1. Regions of discretization of boundary in order to compute the finite- 
difference operators 



So, we have four ways of solving the partial differential equation (1): (i) 
central finite-differences with free natural boundary, (ii) central finite-differences 
with a regions boundary, (iii) non central finite-differences with a free natural 
boundary and (iv) non central finite-differences with a regions boundary. 

To find a numerical approximation of expression (2), we can write this ap- 
proximation as: 

2 

e(xi,X2)= ^ -D^{p^j){xi,X2) + Dy{q^j){xi,X2), 
i,i=l 



( 6 ) 
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where Pij(xi,X 2 ) = aij{xi,X 2 ) ■ Djv{xi,X 2 ) and q_ij{xi,X 2 ) = Pij{xx,X 2 ) ■ 
Dij{v){xi,X 2 ) and Di, Dj and Dij are the finite-difference operators. 

As we said above, using these four methods, we transform equation (1) into 
the ordinary differential equation (3), and using the discrete time derivative 
approximations we can integrate this equation through time using a step-by- 
step procedure, solving the diagonal linear algebraic system done by (4). We 
can evolve the dynamic solution, which is a simulation of surface deformation, 
from given initial conditions at t = 0 by solving the linear algebraic system 
for the instantaneous configuration r(t -|- At) using preceding solutions r(t) and 
r(t — At). 

4 Study of Stability and Complexity 

To see which is the best way of solving the system (4) we need to study the 
system stability depending on the matrix K(r (t)) we are using. 

First of all, we introduce the following notation: 

r„ := r (nZ\f), f„ := f {nAt), B{v (t)) := - K(r (t))^ , 

. In order to study the stability and taking into 

account that rg = r_i, we have to find a matrix G„ and a vector g„ such 
that: r„ = G„ro -I- g„. So, using equation (4), we find the following recurrence 
relationships between matrices and vectors : 

G„ = B(r„_i)G„_i -|- DG„_2, gn = a ^f„_i -|- B(r„_i)g„_i -|- Tlg„_2. 

The system stability is given by the spectral radius p(G„) of matrices G„. That 
is, the system is stable iff lim p(G„) = L yf oo. 

n — KXD 

We have carried out two kinds of experiments and in the following two figures 
we show the spectral radius of matrix Gn in each iteration and experiment. 
In a first place we have deformed a plane with a fixed boundary, displayed in 
Figure 2, then we have deformed the same plane with only the four corners 
fixed, displayed in Figure 3. That is, in the first case we take the initial state 
r°(xi,X2) = (xi,X2,0), with {xi,X 2 ) G [0,1] x [0,1], and we deform the object 
r° with r(a;i,0;t) = (a;i,0,0), r(0,a;2;t) = (0, 2:2,0), r(a;i,l;t) = (xi,l,0) and 
r(l,a;2;t) = (1, 2:2,0) for all t € [0,T]. In the second experiment we deform r'’ 
with r(0, 0; t) = (0, 0, 0), r(0, 1; t) = (0, 1, 0), r(l, 0; t) = (1, 0, 0) and r(l, 1; t) = 
(1, 1, 0) for all t G [0, T], We have simulated an animation for a total time, T, of 
2 seconds with 0.01 time increment. The plane is discretized into M rows by M 
columns (so, if the plane is a unit square and Mis 11, the distance between points 
is 0.1), using a mass and damping density of 1 at each point. The horizontal, 
shear and vertical deformation resistance is fixed at 1 in the case of experiments 
displayed in Figure 3 and at 10 in the experiments of Figure 2. In this figure 
we show the evolution of the spectral radius in three different cases, varying 



and D := A 



/ G M 

I ^ ~ 'Afi 
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Case (iii) 



Case (iv) 



distance 
between 
points 
0.2 



• 0.0666667 



Fig. 2. Graphics of spectral radius in the experiments where we fix the boundary. 
In the horizontal axis, we represent the iteration n, and, in the vertical one, 
the spectral radius of matrix Gn- In each graph we have overlapped the results 
achieved in three different experiments, each one with different distance between 
points 



the plane discretization (M = {6, 11, 16}) in order to study the evolution with 
different distances between points. 

From Figures 2 and 3 we deduce that only with case (iii), that is, the explicit 
method using non-central-differences with a free natural boundary condition, we 
can guarantee the stability of the numerical resolution because limp(G„) = k. 

n 

The other cases aren’t stable because (see Figures 2 and 3) limp(G„) = oo 

n 

and the convergence is very fast. The reason for this is the relevant importance 
played by point {x\,X 2 ',t) where we compute the derivative approximations to 
obtain matrix K. Except in case (iv) of Figure 3 where after some oscillations 
we obtain a similar behavior to case (iii) of Figure 2. 



4.1 Computational Cost of Model Behavior 

We now consider the problem of obtaining the computational cost of our model 
behavior, done by equation (4) in comparison with the development of Terzopou- 
los’ system, which we can see in equation (5). More precisely, we consider the 
study related to the simulation part in the case of a surface discretized in M 
rows and N columns, that is M x N points on the plane. 

Previous to the simulation step it is necessary to construct the vectors and 
matrices which are necessary to solve the ordinary differential equation numer- 
ically. As the computational cost for each of the vectors and/or matrices is 
0(M X N), then the total computational cost in this part is also 0{M x N). 
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Fig. 3 . Graphics of spectral radius in the experiments where we only fix the four 
corners. In the horizontal axis, we represent the iteration n, and, in the vertical 
one, the spectral radius of matrix 



Recall that in each case the related equations are done by 

( + ^) £ (^ + ^^) = - (K(r (t)) - I (t) + ( - ^) r (t - At) 

(^ + 23t + =fr + (2 M^)iW + Lit -At) 

that briefly we express as: 

Di ■r{t + At) = f,, — Al • r(t) + D2 • r (t — At) 



Ti -r{t + At) = + A2 • r (f) + D2 • r (t - At) 



Now we are going to analyze the computational cost, in time, of the sim- 
ulation part in our linear algebraic system. To obtain matrix K(r (t)) we first 
need to compute the expressions involved in equation (2), which is of 0(M x N) 
order, and after that, to build the precise matrix that is also of 0{M x N) or- 
der, because we don’t fill the whole matrix, but rather only the points that are 
involved in the deformation and movement of each point of the discrete surface. 
Finally, we obtain an 0{M x N) order for this step. 

As matrices Di and D2 are diagonal we use an 0{M x N) time order to 
compute them, which is, also the time necessary to compute the Ai matrix, 
because it is the difference between K(r (t)) and (a diagonal matrix). 

On the other hand, the product of a vector by a matrix requires, in general, 
0((M X A)^) time computing. So, this is the time necessary to compute Ai -r (t). 
But, to compute D 2 -r (t — Z\t) we need only 0(M x N) time computing, because 
D 2 is diagonal. Then, we have a 0((M x A)^) computational cost to obtain 
f — Al • r (t) -f D 2 • r (t — At) . 
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Finally we need to solve the diagonal linear algebraic system 



Di • r (t + At) = — Ai • r (t) + D 2 • r (t — At). 

Since Di is diagonal we can compute the inverse Di~^ in linear time and at the 
same time we can make the product by f,. — Ai • r{t) + D 2 ■ r{t — At), so the 
computational cost of the whole process if 0{{M x A)^). 

In Terzopoulos’ case we need to compute matrices A 2 and D 2 , as in the 
previous case, with an order 0{M x N), and we need the same time computing 
to obtain the Ti matrix. 

To solve the linear algebraic system 



Ti • r (f + At) = f,, + A2 • r (t) + D2 • r (t - At), 



we need to compute the vector + A 2 ■ r (t) + D 2 • r (t — At) and in this case we 
use a time computing of order 0{M x N) because A 2 is diagonal. The matrix 
Ti isn’t a diagonal matrix and depends on K(r(t)). So, we must invert it in 
each step of integration. The time required to do it is very expensive. Thus, it 
is necessary to use a decomposition method to obtain Ti“^, so that the total 
computational cost order is 0{{M x N)^). 

Then, the explicit procedure to integrate through time the ordinary dif- 
ferential equation, represented by equation (4), is more efficient, in the time 
computing sense, than the semi-implicit integration procedure, represented by 
equation (5). 



5 Simulation Examples 

An implementation of those models has been made using the C-|— I- language over 
several hardware platforms, also using Openinventor®, a 3D modeling toolkit 
which simplifies visualization and scene composition tasks. 

The following two figures show artistic animation sequences of deformable 
objects, the animations have been obtained with the explicit method using the 
central finite-difference scheme with a free natural boundary condition. 

The first one, shown in Figure 4 represents a handkerchief held by its four 
corners in a gravity field. After two seconds, one of the corners is released, so 
this point is affected by the gravity field too. The Figure caption shows the 
experiment physical parameters. Note that the cloth discretization is a mesh of 
21 rows by 21 columns. 

The second one. Figure 5, represents a tablecloth falling over a table. The 
mesh discretization is the same as for the previous simulation. The collision 
between the tablecloth and the table hasn’t been simulated, this is a goal of 
future work. The simulation parameters are shown in the Figure caption. 

Notice the different evolution of the borders in the two experiments, the first 
one (Figure 4) is more elastic than the second (Figure 5). 
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5.4 s 



6.0 s 



6.6 s 




7.2 s 8.0 s 10.0 s 

Fig. 4. A simulation example, the handkerchief held. A cloth held by its four 
corners, after 2 s. one corner is released. We can see the evolution over 10 s., the 
physical parameters of the cloth are 7711 = 1 , 7712 = 7721 = 1 , 7722 = 1 , = OVz, j, 

the mass and damping density equals 1 at each point of the discrete mesh, and 
the simulation time-step is 0.01 s 
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1.28 s 1.6 s 



4.0 s 




4.8 s 




0.72 s 




2.4 s 




6.0 s 



1.0 s 




3.0 s 



10.0 s 



Fig. 5. A simulation example, the restaurant table. A tablecloth drops over a 
table. The total simulation time of 10 s. has been divided into time-steps of 
0.01 s., the physical parameters of the cloth are 7711 = 4, 7712 = 7721 = 4, 7722 = 4, 
Ci.i = the mass and damping density equals 1 at each point of the discrete 

mesh. The collision effect has been made by setting fixed points at the middle 
mesh after 0.5 s. of simulation time 



6 Summary and Conclusions 

Physically based modeling is necessary if we want to create realistic animations. 
Physically based models add new characteristics to object description in addition 
to geometric descriptors, which are used to control the creation and development 
of the models. To construct the model and the differential equation associated 
with it, different techniques can be used, such as Lagrange equations of motion 
using an internal elastic force, which resists deformation, based on differential 
geometry. After constructing the equations of motion and deformation for the 
models, the equations should be solved as fast as possible using numerical meth- 
ods. 

In this paper we recover and explain a system for animating deformable 
models using a physically based approach. We analyze how to solve the partial 
differential equation using different finite-difference operators, in order to study 
the movement and deformation of deformable objects. We have studied the de- 
velopment of four (spatial) finite-difference resolution schemes depending on the 
error estimation and boundary treatment and, taking into account the advan- 
tages and disadvantages of each formulation. To make the integration through 
time we have used an explicit method that reduces the computational cost from 
0{{M X NY) to 0{{M X NY) order, and to see which is possibly the best we 
have analyzed the stability for each one showing numerical examples. The system 
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uses all the formulations for animating the models so that the user can decide 
which one to use in an animation and then compare the results. As can be seen 
from the section related to the stability we deduce that the explicit method with 
the differential scheme from Terzopoulos et al. is the best one at this moment. 
Obviously, a finite-difference scheme with 0{h?) approximation order is directly 
dependent on the point involved in it, and we haven’t studied all of them, only 
the most appropriate. That is, the schemes using point neighboring to the point 
where we compute the derivative. 

Another aspect that we have in mind is to generalize the study made in this 
paper for surfaces to curves and solids. Moreover, as physically deformable mod- 
els are active, that is, they respond to external influences and interact with other 
objects, it is very interesting to study the response of the model in the presence 
of collisions, in order to incorporate deformable objects in a more complicated 
3D scene. Moreover, we can improve the system if the collision of objects with 
themselves are considered, so that interpenetrating does not appear in the de- 
formation simulation. Another extension may be the study of possible system 
parallelization in order to obtain shorter time responses. 

At the present, we are working on the discretization of equation (2) by using 
a different strategy from equation (6); developing the expressions involved in 
equation (2) in order to obtain a new expression for use in the numerical ap- 
proach. So, if we suppose that rjij is constant, in the case of the first term we 
take 



5r 




and a similar expression for the second term. Next, we apply finite-differential 
operators in order to obtain the corresponding expression (6) for the first term: 



e{xi,X2) = - J2 -Vij ■ Dj{r) + Di(r) ■ Dij{r)) Dj(r) + 

(A(r)-D,(r)-G«.)Aj(r)), 

and an equivalent expression for the second term. The advantages of this new 
method are the following: we maintain the same order of approximation for 
methods (i) and (ii) but in this case we need fewer neighborhood points near the 
central point {xi,X 2 )- So, we expect better results. 
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Abstract. This paper presents a shape and pose estimation method 
for 3D multi-part objects, the purpose of which is to easily map objects 
from the real world into virtual environments. In general, complex 3D 
multi-part objects cause undesired self-occlusion and non-rigid motion. 
To deal with the problem, we assume the following constraints: 

— object model is represented in a tree structure consisting of de- 
formable parts. 

— connected parts are articulated at one point (called “articulation 
point”). 

— as a 3D parametric model of the parts, we employ deformable su- 
perquadrics (we call DSQ). 

To estimate the parameters from the sensory data, we use time model- 
space gradient method, which reduces the parameter estimation problem 
into solving a simultaneous linear equation. We have demonstrated that 
our system works well for multiple-part objects using real image data. 



1 Introduction 

Virtual environment applications such as augmented reality system, man-ma- 
chine seamless interaction, video game console and tele-operation interface re- 
quires the system to estimate scene parameters for the natural objects such as 
human bodies. To realize this requirement, we have been developing vision-based 
3-D multi-part object tracking system which reconstructs the time- varying scene 
parameters of objects including motion, shape deformation and dynamic surface 
texture. In order to solve such a high DOF(Degrees of Freedom) estimation prob- 
lem, we must consider how to reduce the computational cost, and how to increase 
the estimation accuracy. Here, we assume that the objects can be represented 
in a multi-part parameterized and constrained model which is articulated and 
hierarchically structured. Based on this assumption we can reduce the parameter 
search cost. 

Our system acquires the scene parameters of the objects by two steps (Fig.l). 
First, with user’s assist, an initial model is constructed from an initial frame 
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data. To construct the initial model easily, we have developed a GUI based 
3D modeler[l] which can obtain shape and pose parameters of an object semi- 
automatically. And then, the model is tracked in the succeeding frames. This 
paper presents the tracking method, or, estimation of model parameters, in the 
succeeding frames. 

Because the appearance of multi-part objects such as humans and animals 
changes considerably according to their shape deformation, their motion and 
viewing angle, it becomes very difficult to estimate their parameters with high 
DOF. We cannot assume ’rigid’ shape model, and moreover, in practice, it is 
impossible to know the precise appearance-based object model, i.e., a set of “in- 
stance” object models, in advance. Rather we need a class object model (FISHER 
et aZ.,I997[4]; BIEDERMAN,I987[5]) which can be deformable, and the number 
of parameters required to represent the model should be as small as possible 
to make their estimation easier. Therefore, we have adopted Deformable Su- 
perquadrics model (METAXAS, et aZ.[6]) as a primitive of geometric model de- 
scription(see Sec. 3). In addition, we have employed a multiple camera system 
to relieve the difficulty caused by self-occlusion among the parts of the object. 

In this paper, we briefly show our method for generic 3-D multi-part object 
tracking and demonstrate that the implementation system works well for multi- 
ple viewpoint image sequences in which a multi-part object causes self-occlusion. 




Display 



Fig. 1. System overview 
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2 Modeling of the Multi-part Object 

In our system, we consider natural objects such as humans and animals. They 
are multi-part non-rigid objects which have strong constraints of joints between 
the parts. Here, we assume structural constraints as follows: 

~ an object model is represented in a tree structure consisting of deformable 
parts. 

~ connected parts are articulated at one point (called “articulation point” ) . 

~ as a 3D parametric model of the parts, we consider deformable superquadrics 
[6] (we call DSQ). 



2.1 DSQ Geometry 

When (?7,w) is a material coordinate, a point on SQ,{superellipsoids) e is: 



/ei(?7,w)\ f -CJ\ 

e{r],uj) = 62(77,0;) = a 02 ■ , 

\ 63(77,0;)/ \ 03-5'^^ / 

where 

= sign(cos77;)|cos77;/,5'()j = sign(sin77;)| sin w/, 

and a, ai, 02, as are scaling parameters, ei,C2 are squareness parameters. 
Using SQ e, a point on DSQ s is expressed as follows: 



s = 




/ Ui 63 , , , ,63 -I- &2 

( 1- l)6i -I- Oi COS( 

aa.3 0-0.3 

^263 

( + 1)62 

aas 



nbs) ^ 



V 



63 



( 1 ) 



( 2 ) 



where ti,t2 are tapering parameters, and 61,62,63 are bending parameters. 



2.2 Multi-part Model Geometry 

An object model is represented in a tree structure consisting of the 3D deformable 
parts. Fig. 2 illustrates the multi-part model geometry. We use the word level to 
represent the depth of each part from the root node. Let p^. denote the position 
of a point with respect to the 4 >i, the local coordinate system of a part rii. 
When the level of a node should be represented, we use an additional suffix like 
which means the coordinate system of a part rii, whose level is k. Then, p^, 
the position of a point with respect to the world coordinate system can be 
expressed as follows: 

~ In the case of the root part, or fc = 0 

I T> ^ 

- *0-0 +^0ioP'^“o’ 



where 



( 3 ) 
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• t^. = {xig,yig, Zig)’^ Is thc positioii of the origin Oig of cj>ig with respect 
to the world coordinate system 

{uig) is the rotation matrix of 4>ig with respect 
to where Rj;(w), Rj,(u)R^(t(;) are the rotation matrix around X,Y and 
Z axes. 



/I 
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0 \ , 


( cos w 
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sin V \ 


R;c(m) = 0 


cos u 


— sinw , Ry(w) = 
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( cosw 
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~ In the case of /c > 0 






(4) 



Ps> — +R0’°(jii H (t^.^ ^-fR^.^ MJifc +P0ifc)) ■ • •))! (5) 



where 

• is the position of the origin Ot- of (j)i. with 
respect to the coordinate system of its parent part 

• is the rotation matrix of (j>i^ 
with respect to 

• Sy is the center position of DSQ Ui^ with respect to 4>ij , 

• jif, is the position of the model center of part nt^, with respect to (jji^ . 



Under the above geometry, establishing the world coordinate we translate it 
into the image coordinate {Um, Vm)^ by Tsai’s camera model[ll]. 

For parameter estimation, we do not estimate translation parameters t or J 
because we assume constraints of joints between the parts. Translation parame- 
ters depend on only shape parameters and we calculate the translation param- 
eters according to shape parameters. 



2.3 Multi-part Object Model Description 

From the definitions mentioned above, the model parameters of each part can be 
expressed as: 



shape parameters: (a, ai, 02, 03, ei, C2, ^1,^2, bnh, bs) 
pose parameters: {x,y, z,u,v,w, jj:, jyjz) 

where shape parameters are DSQ parameters for each part in Eqs.(l),(2) and 
pose parameters are translation and rotation parameters in Eq.(3)(in case it is 
the root part) , or Eq.(4)(in case it is a non-root part). 
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root part(level 0) 




3 Multi-part Object Tracking 

3.1 Tracking System 

An outline of the 3D multi-part object tracking is as follows: 

1. Acquire an initial model, referring to initial multi- viewpoint frames. 

2. Acquire model sample points from the previous frame for every viewpoint. 

3. Estimate model parameters for each part in accordance with the estimation 
strategy. 

4. Refine the model parameters in accordance with the result of #3. 

5. Repeat #2-^4 in the succeeding multiple-viewpoint frames. 

3.2 Tool for Initial Modeling 

The aims of our modeling tool is to construct a structure of an object model, and 
to estimate the initial model parameters, where the modeler fits the model view 
to initial multi-viewpoint image frames. In order to achieve them, our modeler 
have the following functions: 

— multi-view overlay support of initial image frames 

~ virtual camera set up in the configuration in which real cameras are set 
~ interface of model parameter control 

Fig. 3 shows typical display of the modeler. We can control scale, shape and 
pose parameters for each part model so as to fit the model views to all the overlaid 
multi-view frames. We can also fit the part models semi-automatically, using 
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parameter estimation process.^ In this modeler, a user constructs the object 
model, whose parameters match the initial frames, from the root part in the 
structural order. 




Fig. 3. GUI-based Initial Modeler 



3.3 Acquiring Model Sample Points 

In our methods, we use points with high textureness on model surfaces for track- 
ing. Since we do not use the contours but points on the surfaces of the model 
for tracking, we can ignore the influence of complex background. It is difficult 
to extract its contour with high reliability, especially when the background of a 
part includes other parts. The procedure to acquire sample points on a model 
surface is as follows: 

1. make projections of all model parts using a Z-buffer algorithm. 

2. extract image features from the previous frame in accordance with eigen- 
values [7] 

3. for each extracted feature point, obtain its corresponding material coordi- 
nates by finding the nearest vertex of a polyhedron approximating the model 
in DSQ. If a normal vector of the model surface on a sample point and the 

^ First, a user crops the image region of each part, and extracts the contour 
points. Then our modeler fits them to the part model using parameter esti- 
mation. 
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optical axis of the camera are nearly perpendicular to each other, the sample 
point will not be used for tracking because it is considered that the error of 
its calculated material coordinate is large. And, if a sample point is near to 
the boundary between the parts on image, the sample point will not be used 
for tracking because the point may be influenced by occlusion. 



3.4 Estimation of Model Parameters 

Estimation Strategy When we acquire sample points on the model surface, 
we can estimate the model parameters based on the time model-space gradients. 
In the proposed method, to reduce computation time we use a simple top-down 
strategy under the structural constraints, that is, estimation of the parameters 
of each part recursively according to the tree structure of the object. 

estimation strategy: 

1. Set i = root. 

2. Estimate the parameters for the parti{see 3.4.2). 

3. If it has descendant parts, then estimate their parameters recursively 
(goto 2). 

4. If not, stop the recursion. 



Estimation of Model Parameters Our system adopts an estimation method 
based on time model-space gradients which can reduce computation time. For 
tracking, we assume that the brightness (an appearance in an image) of a point 
on a model surface does not change between adjacent frames. Let Ij{p,t) be 
the brightness of point j on the model surface whose model parameters are 
p = {pi,p 2 , ■■.,Pn) at time t. If model parameters are p = {pi,p 2 , ...,Pn) at 
time t and model parameters are p + Ap at time t-\- At, we obtain the following 
equation. 

Ij{p-\- Ap,t-\- At) = Ij{p,t). (6) 

Here, let Vj(p) be the image coordinate (xj,yj) of the point j on the model 
surface whose model parameters are p. We rewrite Eq.(6) in the form 



I{vj{p + Ap),t + At) = I{vj{p),t). (7) 



where the function /(v,t) is the brightness at the image point v = (x,g) at 
time t. If the brightness varies smoothly on the model surface, the brightness 
varies smoothly with x, y, and t. Then, by expanding the left-hand side of the 
Eq.(7) in a Taylor series and ignoring second- or higher-order terms, we obtain 
the following equation: 



Api 



/ dl dx 
\ dx dpi 
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We obtain Eq.(9) by simplifying Eq.(8). 



^91 ^ dl dl ^ 

^Pi t ; !-■•• + Apn t: ^ ~ ^ 

opi opn ot 



(9) 



Eq.(9) consists of model parameter changes and image gradients with respect 
to time and model space. We call Eq.(9) time model-space gradient constraint 
equation. 

This single constraint is not sufficient to determine the parameters. It is 
clear that if sample points belong to the same model, their model parameters 
are common. Therefore, we collect Eq.(9)s of sample points which are on the 
same model surface and obtain a simultaneous equation as follows: 

A/-x = b/, (10) 

where 
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and k is the number of sample points. Ideally, it is sufficient to determine the 
model parameters when fc = n. In fact, to reduce the influence of noise, we use 
more sample points and apply the least square method to the equation, that is, 
we estimate x which minimize 



|Ar-x-b/p. (12) 

As mentioned above, our estimation method finally becomes a problem of 
liner system. Thus this method takes lower computation cost than our previous 
methods based on analysis by image synthesis[8][9] which have to be solved by a 
non- liner optimization method. However, since this method makes the assump- 
tion that the brightness on a model surface do not change, our method has a 
drawback that it is easily influenced by noise of observation. To overcome this 
problem we apply a smoothing filter to input images. 

4 Experimental Result 

4.1 Shape and Pose Estimation 

For this experiment we prepared image sequences of moving upper half of the hu- 
man body using six cameras. Multiple viewpoint image sequences are obtained 
by our capture system[10] with PC-Cluster. Multiple cameras are calibrated 
by Tsai’s method[ll]. First, we constructed the model which consists of four 
parts(body, right arm, left arm and head) and three joints. Then, we estimated 
pose and shape parameters over 120 frames. Fig. 4 shows several input images and 
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projected images of the estimated model. In this case, about 100~200 feature 
points are detected in each part and used to estimate the parameters. Although 
the object includes self-occlusion and texture of both arms are very similar to 
each other, these part models are tracked successfully on whole multiview se- 
quences. 

Computing time of the analysis using a 500MHz Pentiumlll PC is about 
1 sec for 1 frame of each view, and only 10 msec is required to minimize the 
formula 12^. Our proposed method can track the object 50-100 times faster than 
our previous method, which have employed a non-linear optimization technique. 

4.2 Effect of the Number of Viewpoints 

Next, to measure the effect of the number of viewpoints, we tracked a human in 
the samce scene as used in 4.1 with 2, 3 and 6 cameras and made a comparison 
among their estimation results. Fig. 5 shows projected images of estimated model 
of each case at the last frame of image sequence. When we used 3 cameras, 
the estimation of body was more accurate than when used 2 cameras although 
the right hand was not correctly tracked. And when we used 6 cameras, the 
estimation of both arms was more accurate than when used 3 cameras. Up to 6 
cameras, we have confirmed that estimation is more accurate as the number of 
viewpoints increases. 

However, we have not yet investigated the effect of the increase of viewpoints 
in detail, especially how to arrange the viewpoints to get the accurate estimation. 
When a moving object is observed, optimal solution may not be acquired in 
advance. Rather it is necessary to control the viewpoints dynamically with active 
cameras or to select good positions to observe from a lot of cameras. We will 
start such directions. 

5 Conclusion 

To estimate shape and pose information for objects in time varying image se- 
quences, we have adopted two kinds of information: one is a a priori object 
model, which consists of multiple-deformable parts; the other is an a posteri- 
ori observation, which means recovering 3D shape and pose parameters from 
multiple-viewpoint image sequences. The advantages of our approach is that 
non-rigid multi-part objects can be handled, and that, especially, self occlusion 
can be dealt with by multiple cameras. To estimate the model parameters we 
have used time model-space gradient method, which can drastically reduce the 
computational cost. 

The future works are summarized as follows: 

— Selection of adequate viewpoints when the number of viewpoints increases, 
which can reduce the computation time and can get more accurate parameter 
estimation. 

^ We can achieve faster calculation with an appropriate parallel processing 
scheme. 
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— Development of a real-time system, using parallel processing techniques es- 
pecially on our PC-cluster[12]. 

— Introduction of more complex 3D model to represent natural objects with 
higher reality. 

Then we will develop a system to map efficiently real world objects into a virtual 
environment for various applications. 
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Abstract. A temporal modelling and prediction scheme based on mod- 
elling a ‘history space’ using Gaussian mixture models is presented. A 
point in this space represents an abstraction of a complete object his- 
tory as opposed to finite histories used in Markov methods. It is shown 
how this ‘History Space Classifier’ may be incorporated into an existing 
scheme for spatial object modelling and tracking to improve tracking 
speed and robustness and to classify object ‘behaviour’ into normal and 
abnormal. An application to the tracking and monitoring of livestock is 
also presented in this paper. 

Keywords: Behaviour Model, History Space Classifier, Livestock 



1 Introduction 

In recent years there has been much interest in object tracking [6,1,15,14,10,11] 
and temporal modelling [7,13,16,3]. The combination of object tracking and tem- 
poral modelling gives rise to many exciting application possibilities. Isard and 
Blake [6] use a temporal model to improve the speed and robustness of their 
object tracker. Wren and Pentland [16] and Davis and Bobick [3] use tempo- 
ral models to classify observed human movements and in the case of the latter 
use this information to trigger interactive responses in a virtual environment. 
Johnson et. al. build a joint behaviour model in which a virtual human reacts 
in a realistic manner to observed behaviour in a limited domain. Sumpter and 
Bulpitt [13] use object tracking and temporal modelling to predict the behaviour 
of a flock of ducks or sheep for use in the control system of a robotic sheepdog. 

These applications are merely the tip of the iceberg. As computer power 
increases, costs come down and tracking algorithms become more robust the 
number of applications for such techniques is endless. Today we see the prolifer- 
ation of CCTV cameras throughout our environment, tomorrow a vision system 
may be connected to these cameras analysing the movements of people, vehicles 
and other moving objects in our environment if we decide that this is desirable. 

We present here a temporal modelling scheme based on modelling an object 
history which uses Gaussian mixture models to predict and classify future object 
behaviour. The object history is based on modelling a complete object history 
by a single point in an abstract ‘history space’. We show how this ‘History 
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Space Classifier’ can be incorporated into an existing object tracking scheme to 
improve speed and robustness and how the results of this may be used to classify 
object behaviour. We present an example application which is the monitoring of 
livestock to detect abnormalities. 

2 Building Spatial Models of Livestock for Tracking 

Previous work has described how multiple contour models of an object class 
may be built [8] and how the variation within these models can be separated 
into independent components [9] . These models are a variation on the Point Dis- 
tribution Model [15] in which a contour is modeled by a mean shape (described 
by a set of points) and a set of linear ‘modes of variation’. We have applied 
this scheme to modelling livestock, building multiple models of cow outlines and 
separating the variation of each model into inter-animal, front legs and rear legs 
components as shown in figure 1. 




Fig. 1. Separating Object Variation Using a Hierarchical Scheme 



These models are used in conjunction with a tracking algorithm to obtain 
the position and pose of individuals over time. The object tracker consists of 
three discrete components; i) A low level position estimator, ii) A stochastic 
stage using multiple discretised shape models and iii) A ‘local search’ algorithm 
using a single continuous model. The low level position estimator is used only 
for initialisation (and re-initialisation) and gives a rough estimate of position 
and scale. The stochastic stage uses discretised versions of the shape models 
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described previously and iteratively samples possible solutions. Probabilities of 
particular solutions are updated after each iteration in a manner conceptually 
similar to the Markov Chain Monte Carlo algorithm [5] . A predictor may be used 
to initialise these probabilities. The local search stage improves on the solution 
given by the stochastic stage by performing a local image search using the most 
appropriate continuous model. This scheme has been described in more detail 
previously [10]. 



3 Building Temporal Models of Livestock Movements 

There has been much related work in the field of temporal modelling and predic- 
tion. Initial efforts such as those by Terzopoulos and Szeliski [14] used predictive 
filters such as the Kalman filter to make a prediction of future object behaviour. 
The application of these is limited as many objects of interest exhibit non-linear 
or multiple possible behaviours. These predictors may be combined with hand 
built [11] or learned [1] models of object (in particular human) kinematics. Re- 
cent work by Wren and Pentland [16] uses multiple extended Kalman filters in 
modelling the dynamics of human motion. A complex model of human dynam- 
ics is built by hand and separate extended Kalman filters are used to model 
alternative ‘atomic behaviours’. 

An alternative approach for modelling complex object ‘behaviours’ is to 
‘learn’ a model of behaviours from typical observations. If the state of an object 
can be represented by one of a finite number of tokens, techniques such as Markov 
chains and Hidden Markov Models may be used. Wren and Pentland [16] use 
Hidden Markov Models to link together sequences of ‘atomic behaviours’. These 
model a finite history of states and the computational expense of these models 
increases with the length of this history. Davis and Bobick [3] use a ‘motion his- 
tory image’ in which pixel intensity is based on time since motion last occurred 
at this pixel. These images are compared to a library of images of typical actions 
to classify unknown actions. Johnson et. al. [7] represent a complete object his- 
tory as a point in a high dimensional space. This is derived from the distances 
between a continuous object state vector and a set of prototypical state vectors. 
A maximum time decay limit operator is used to incorporate a history of states. 
This space is quantised using Vector Quantisation and a Markov Chain is used 
to make predictions about the likelihood of future histories. A CONDENSA- 
TION framework [6] is required to make predictions about future object states 
as it is not possible to reconstruct a set of states from a point in the history 
space. Sumpter and Bulpitt [13] also represent a history by a point in a high 
dimensional space, however this is works on a winner takes all approach using 
a spatial classifier based on a vector quantising (competitive learning) neural 
network. A set of leaky integrators are used to integrate pulses from the winning 
spatial classifier outputs. A second neural network is used to classify the ‘history 
space’ produced by the leaky integrators into possible next states and a feedback 
loop is used to allow the output to correct for erroneous spatial classification. 
This scheme relies on sets of histories resulting in particular next states being 
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separable using a single vector quantisation prototype for each next state. This 
will not be the case for all data sets. This scheme is also not very computation- 
ally efficient for data sets with a large number of states as the computational 
expense of evaluating the neural networks is related to the number of states. Fig- 
ures 2 and 3 illustrate the differences between the two ‘history space’ schemes 
of Johnson and Sumpter. 



Prediction from Last Time Step 




Fig. 2. The Sumpter and Bulpitt Prediction Scheme 




Fig. 3. The Johnson Prediction Scheme 



In our spatial scheme we have multiple models and thus a measure of distance 
between an object state and a prototypical object state is only available when 
the same model is used for both. This precludes the formation of a history space 
such as that used in the Johnson scheme. A similar method to the Sumpter and 
Bulpitt scheme could be applied to our data however with the high dimension- 
ality of our data this scheme would be slow. Also, as can be seen in figure 6, the 
state transitions in our data are complex and a simple vector quantisation of 
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a history space (as with the context networks) would not be appropriate We 
have developed an alternative ‘History Space Classification’ (HSC) scheme based 
on the Sumpter idea of ‘winner takes all’ however reducing the dimensionality 
of the history space using Principal Components Analysis (PCA) for improved 
prediction speed, and using Gaussian mixture models instead of neural networks 
to perform the temporal classification to deal with our complex history space. 
The scheme is detailed in figure 4. 




Spacial 

Prototypes 

(Over Multiple Models) 



Gaussian 
Mixture Model 
Evaluation 



Fig. 4. Prediction scheme used in our system 



It should be noted that there are two differences between our scheme and 
the Sumpter and Bulpitt scheme in addition to those mentioned previously. The 
first is the fact that there is no feedback loop in this scheme. In our scheme 
the low dimensionality mixture models have the power to generalise and thus 
give a sensible prediction for unseen histories (this is demonstrated in section 
4). There is therefore no need to force the history into a previously observed 
state using feedback. In addition our spatial classification comes from an object 
tracker, rather than a neural network, and we must assume that this has some 
degree of correctness otherwise performing a prediction from this is not a sensible 
course of action. The second additional difference is the use of the reciprocal of 
the time since a state last occurred as the history space representation rather 
than a representation derived from a set of leaky integrators. This gives a precise, 
easily calculable (fast), approximately frame rate independent, two way mapping 
between a sequence of states and a point in the history space. No time constant 
needs to be determined for this scheme as in the Sumpter and Bulpitt method. 
No evaluation of different history schemes has been carried out as the scheme 
used is sufficient for the data sets we are working with. It is hypothesised that 
different representation schemes will work well for different data sets depending 
on the length and nature of the history required to make a prediction. 

^ Note: The Johnson scheme produces a much simpler space as there are no discon- 
tinuities caused by using the winner takes all approach. This space is suitable for 
vector quantisation. 
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3.1 Building Gaussian Mixture Models 

There are two approaches that may be taken to build a Gaussian mixture model 
of a probability density function from a training set of multivariate data points 
(histories in our example). The first is the ‘kernel method’ or ‘adaptive kernel 
method’ in which each data point in the set of training examples is modeled 
by a single Gaussian. This is described well by Silverman [12]. This results in 
a model with a large number of Gaussian mixtures which can take considerable 
computational expense to evaluate. An alternative approach is to build a mixture 
model with fewer Gaussian mixtures than training data points which may be 
done using the Expectation Maximisation (EM) algorithm [4]. This is a two 
stage iterative scheme which computes the local maximum likelihood fit of an 
arbitrary number of Gaussians to a data set. This scheme is heavily dependent 
on how the Gaussians are initialised (this is not covered by the EM algorithm) 
and can lead to an over specific mixture model (see figure 6) . Another drawback 
of this scheme is that, if the training data is locally sparse, mixtures may become 
singularities. Gootes and Taylor [2] present a scheme which combines the kernel 
method with the EM algorithm by altering the M step as in the table in figure 5. 
In this scheme singularities never occur and the result is less dependent on the 
initialisation of the mixture model. 



Standard EM Algorithm 


Cootes and Taylor Modified EM 
Algorithm 


E-step: Compute the contribution of 
the ith sample to the jth Gaussian 


E-step: Compute the contribution of 
the ith sample to the jth Gaussian 


M-step: Compute the parameters of 
the Gaussians 

'^3 — N 


M-step: Compute the parameters of 
the Gaussians 

'^3 — N Z^i—lP'^3 5 f^3 — 


l^i=l 





Fig. 5. EM Algorithms to fit a mixture of m Gaussians to N samples x^. is 
the kernel covariance of a sample calculated using the adaptive kernel method 



We use the Gootes and Taylor modification of the EM algorithm to build 
our mixture models. In our scheme we initialise the mixture means uniformly 
across the range of the data removing mixtures for which the mean is the nearest 
mixture mean to no data point in the training set. Initial mixture covariances 
are calculated from the covariance of the training set, the number of Gaussians 
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Fig. 6. Building Gaussian Mixture Models: (a) Raw Data (b) Standard EM 
Algorithm (c) Cootes and Taylor Modified EM Algorithm 



and the dimensionality as in the kernel method [12]. The initialisation is such 
that more mixtures than required are selected. This results in mixtures becom- 
ing co-incident (having equal mean and covariance) during the course of the EM 
iterations. Under such circumstances we modify the scheme to merge these mix- 
tures after the M stage of the iteration. This results in a mixture model with an 
identical density description which is computationally more efficient to evaluate. 

3.2 Comparing Mixture Model Densities 

The scheme described previously produces a set of Gaussian mixture models 
which represent probability density functions (PDFs) within the history space. 
To make a prediction from this set of models that a newly observed history will 
result in a specific next state we must determine the relative probability that 
the history comes from each model distribution. Statistically the probability of a 
continuous distribution resulting in a particular sample value is zero. To obtain 
a non-zero probability (P) we must integrate the PDF {cj>) over some area local 
to the sample. Over a suitably small local area this is approximately equal to 
the product of the PDF magnitude at the sample and this area {6Ai) as in 
equation 1. 



P{x) = / (j){x)Sa « (f>{x) X 6Ai (1) 

SAi 

To the authors knowledge there is no statistically correct method to deter- 
mine size of the local area (<5A/) over which to integrate as this will depend on 
the range of, and magnitude of variation within the PDF. It can be shown how- 
ever that it is valid to use the same area if the probability that a sample comes 
from either of two distributions is equal when considering the complete range of 
the data (proof is omitted for brevity) . In our example there is significant vari- 
ation in the range of the model PDFs and as such it would be incorrect to use 
the same value of 5Ai for all models. As no scheme is available for the selection 
of SAi we take the approach of calculating uniform distributions with roughly 
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similar ranges to the mixture model PDFs Examples of this can be seen 

in figure 7. 




Fig. 7. Comparing PDFs with uniform PDFs with the same Range 



It is hypothesised that these distributions are similar enough in range and 
variation to use the same value of Ai for each pair of distributions. We can thus 
calculate the relative probability {PRei(Uniform)ix)) that a sample results from 
a mixture model distribution or the corresponding uniform distribution as in 
equation 2. 



p /-N X SAi (j){x) 

ReliUmform) {(j){x) X 5 Ai) + {(j)u{x) X 5Ai) (j){x) + (j)u{x) 

The relative probabilities obtained in equation 2 can be compared across 
mixture models and as such relative probabilities of a history resulting in a 
particular next state may be calculated as in equation 3. 



PRel{Xi) = 



Pi 



Rel{Uniform 



)ix^) 



N 



(3) 



PRel{Unifor7n){Xn) 

n—1 



This method of comparison gives significantly more robust prediction for seen 
and unseen data than simple comparison of PDF heights (for our data sets) . 



4 Evaluation of Predictor Scheme 

There are many metrics and tests that may be used to evaluate a predictor 
scheme, however these may be broadly divided into two types; success measures 
and failure measures. Success measures (such as average error) measure how well 
a predictor predicts future states for unseen data and failure measures (such as 
the proportion of erroneous predictions) measure the robustness of the predic- 
tion scheme. It is important to evaluate a predictor scheme using both success 
measures and failure measures to obtain an accurate description of its properties. 
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In our scheme there are multiple spatial models and thus a simple error mea- 
sure is not always available, however we can look at the proportion of correct 
model predictions and the errors within the set of correct predictions. An alter- 
native approach is to evaluate the scheme as a finite state predictor, looking at 
prediction probabilities for newly observed states. In our application multiple 
next states are possible from a given history and as such a unit (100% Probabil- 
ity) prediction for the next observed state is not what is desired. The predictor 
should assign a ‘significant probability’ to any next state that has a ‘reasonable 
probability’ of actually occurring. It is difficult to determine what a ‘significant 
probability’ or a ‘reasonable probability’ should be as this depends on the nature 
of that particular prediction. 

In this evaluation we state that any new state observed has a ‘reasonable 
probability’ of being observed (as it has been observed) and any probability 
greater than the probability given by a uniform predictor is a ‘significant prob- 
ability’. From this we define a measure ‘Probability Drop Out’ which is the 
proportion of unseen future states that are assigned a probability of less than 
that given by a uniform predictor. As a measure of success we also examine the 
mean and standard deviation of predicted probabilities for each next state ob- 
served. These are broken down by the next state observed as these predictions 
should have similar context (average results are shown). All results shown are 
from leave 2 out tests from a set of 12 sequences of healthy walking cows (ap- 
prox. 1000 frames). The results for first and second order Markov chains are also 
presented for comparison. 

The results shown in figure 8 show that the HSC scheme assigns a reasonable 
probability to actual next states observed for unseen data on average, although 
this is lower than the average for the Markov chain results. Markov chains how- 
ever have higher drop out rates and standard deviations. Examining the indi- 
vidual results we see that the high average prediction of Markov chains is due to 
unrealistically high probability predictions for certain frames averaged with very 
low probability predictions in others. This makes the Markov chain unsuitable 
for our application which requires a robust prediction mechanism. These results 
are discussed further in section 5. 

To evaluate the predictor as a behaviour classifier we used sequences of limp- 
ing and normally walking humans as insufficient data of lame cows was available. 
A single two space (inter and intra-person) spatial model was built of human 
outlines and the spaces quantised exactly as in the three space models of cows. 
Eight sequences were taken of five individuals (4 Normal and 4 Limping) and 
a set of leave one sequence (of each person) out and leave one person out tests 
were conducted. It was found that simply comparing the outputs of the two 
predictors biased the results towards the lame predictor as for approximately 
half the walking cycle limping is identical to normal walking motion and thus 
the lame predictors fit well to healthy data. The solution to this was to discard 
frames where both predictors assigned ‘significant probability’ to the observed 
next state and perform the classification on the remaining frames comparing the 
mixture model outputs as in section 3.2. 
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Rear Legs Predicturs: Drop Out Rate 




Front Legs Predictors: Average Prediction 




Rear Legs Predictors: Average Prediction 




Front Legs Predictors: Mean Standard Deviations 




Rear Legs Predicturs: Mean Standard Deviation 




Fig. 8. Prediction Results for Cow Data 
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Fig. 9. Classification Results for Human Data (2D HSC Predictor) 



In applications involving tracking it is important that a predictor can be eval- 
uated faster than the desired frame rate for the tracking algorithm. Approximate 
average evaluation rates are given in figure 10, using an entry level workstation 
(SGI R5000 180MHz 96MB). It should be noted that only one set of timing 
results is presented for our history space classifier scheme as the timings for the 
two space and single space versions are similar. 
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* 2 Space 2nd Order Markov Chain Evaluation takes several miniutes due to memory usage. 



Fig. 10. Evaluation Rates for Predictor Schemes 
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5 Discussion 

The Effect of Dimensionality Reduction: The Principal Components Anal- 
ysis stage of the predictor (see figure 4) reduces the dimensionality of the history 
space in order to increase the speed of the predictor. The results in figure 8 show 
that history space classifier (HSC) predictors with a high dimensionality reduc- 
tion (70 to 2,3 or 4 in the two space case) can give usable predictions. The 
effect of dimensionality reduction is twofold; decreased dimensionality gives a 
lower average prediction however at lower dimensionalities drop out rates are 
generally smaller (except for the single space ID case). We can conclude that 
increasing the dimensionality makes the predictor more specific and less general. 
It does however appear there may be a limit to this and there is a dimension- 
ality above which adding extra dimensions will have no effect. More results are 
required to verify this hypothesis. We must trade off generality, specificity and 
speed when choosing the dimensionality. In the example presented (cows) a 2 or 
3 dimensional classifier would seem suitable. 

Two Space predictors vs Single Space Predictors: It was hypothesised 
that making predictions about future front leg movements would be more robust 
if information about rear leg histories was used in addition to information about 
front leg histories (and vice versa). This was achieved by concatenating the front 
and rear legs history spaces before the Principal Components Analysis part of 
the predictor. No firm conclusions may be drawn on the relative merits of the 
single space and this two space scheme however the results suggest a single 
space predictor may be more suitable for the front legs predictors and a two 
space predictor may be more suitable for the rear legs predictor (this is based on 
drop out rates and standard deviations). This is a sensible result as the variation 
in position of the front legs is much larger (as they bend more) and thus there 
is more information about the position in the walking cycle in the front legs 
positions than the in the rear legs positions. We can conclude that the merits of 
the two schemes are application specific. 

Use of Predictor as a Classifier: The results in figure 8 show that our History 
Space Classifier scheme has improved generality over Markov chain schemes 
which is seen in lower drop out rates and lower prediction standard deviations, 
however specificity is also required in order to make a behaviour classification. 
The results in figure 9 suggest that, even for the low dimensionality HSC used, 
enough specificity has been preserved in the predictor to make a reasonable 
classification even for unseen individuals in this relatively small test set. 

6 Conclusions 

We have presented a spatio-temporal modeling scheme using a hierarchical spa- 
tial model and a novel ‘History Space Classifier’ temporal prediction scheme. 
This scheme can be used to track objects and classify their movements into ‘be- 
haviours’. As an example the detection of lameness in cows is presented. The 
scheme has also been evaluated on sequences of limping humans. The prediction 
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or classification can be performed faster than current video frame rates on a 
relatively inexpensive computer and has a relatively small memory requirement 
which means it may be incorporated into a ‘real time’ behaviour evaluation 
system. 
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Recognition of Articulated Objects 
in SAR Images 
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Abstract. This paper presents an approach for recognizing articulated 
vehicles in Synthetic Aperture Radar (SAR) images based on invari- 
ant properties of the objects. Using SAR scattering center locations and 
magnitudes as features, the invariance of these features with articulation 
(e.g. turret rotation of a tank) is shown for SAR signatures of actual 
vehicles from the MSTAR (Public) data. Although related to geomet- 
ric hashing, our recognition approach is specifically designed for SAR, 
taking into account the great azimuthal variation and moderate artic- 
ulation invariance of SAR signatures. We present a recognition system, 
using scatterer locations and magnitudes, that achieves excellent results 
with the real SAR targets in the MSTAR data. The articulation invari- 
ant properties of the objects are used to characterize recognition system 
performance in terms of probability of correct identification as a func- 
tion of percent invariance with articulation. Results are also presented 
for occluded articulated objects. 



1 Introduction 

In this paper we are concerned with recognizing articulated vehicles, starting 
with SAR image chips of various target vehicles from the MSTAR (Public) tar- 
get data set [9] and ending with identification of the specific vehicle type (e.g., 
a T72 tank). The major challenge is that the vehicles can be in articulated con- 
figurations such as a tank with its turret rotated. In addition, the articulated 
vehicles can be partially hidden (occluded). Previous work in recognizing artic- 
ulated objects in optical images [1] [4], has used simple models (like scissors and 
pantograph lamps) and has used constraints around a joint to recognize these 
objects. Because of the unique characteristics of SAR image formation (specular 
reflection, multiple bounces, low resolution and non-literal nature of the sensor), 
it is difficult to extract linear features (commonly used in visible images), espe- 
cially in SAR images at one foot resolution. Previous recognition methods for 
SAR imagery using templates [10] [8] are not well suited for the recognition of 
articulated objects, because each different articulation configuration requires a 
different template leading to a combinatorial explosion. 

We approach the problem of recognizing articulated objects from the funda- 
mentals of SAR images. We characterize the SAR image azimuthal variance to 
determine the number of models required. We demonstrate (and measure) the 
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SAR scattering center location and magnitude invariance with target articula- 
tion. Based on these invariants, we develop SAR specific recognition methods 
that use non-articulated models to successfully recognize articulated versions of 
the same objects. The key contributions of the paper are the development and 
evaluation of a method, based on location and magnitude invariant properties 
of scattering centers, to recognize articulated objects in one foot resolution SAR 
data. 

2 SAR Target Image Characteristics 

The typical detailed edge and straight line features of man-made objects in the 
visual world, do not have good counterparts in SAR images for sub-components 
of vehicle sized objects at one foot resolution. However, there are a wealth of 
peaks corresponding to scattering centers. The relative locations of SAR scatter- 
ing centers, determined from local peaks in the radar return, are related to the 
aspect and physical geometry of the object, independent of translation and serve 
as distinguishing features. In addition to the scatterer locations, the magnitudes 
of the peaks are also features that we use in this paper. 

Photo images of the MSTAR articulated objects, a T72 tank serial number 
(#) a64 and a ZSU 23/4 antiaircraft gun #d08 are shown in Fig. 1. Example 
SAR images and the regions of interest (ROI), with the locations of the scat- 
tering centers superimposed, are shown in Fig. 2 for baseline and articulated 
versions of the T72 and ZSU. The MSTAR articulated object data is all at one 
foot resolution and 30° depression angle. The ROIs are found in the MSTAR 
SAR object chips by reducing speckle noise using the Crimmins algorithm in 
Khoros [6], thresholding at the mean plus two standard deviations, dilating to 
fill small gaps among regions, eroding to have one large ROI and little regions, 
discarding the small regions with a size filter and dilating to expand the extracted 
ROI. The parameters used in extracting ROIs are held constant for all the re- 
sults reported. The scattering centers are extracted from the SAR magnitude 
data (within the boundary contour of the ROI) by finding local eight-neighbor 
maxima. 



2.1 Azimuthal Variance 

The typical rigid body rotational transformations for viewing objects in the vi- 
sual world do not apply much for the specular radar reflections of SAR images. 
This is because a significant number of features do not typically persist over a 
few degrees of rotation. Since the radar depression angle is generally known, the 
significant unknown object rotation is (360°) in azimuth. Azimuth persistence 
or invariance can be expressed in terms of the average percentage of scattering 
center locations that are unchanged over a certain span of azimuth angles (when 
we compare scatterer locations in the ground plane of an image rotated by some 
azimuth increment with another image at the resulting azimuth angle). Since 
the objects in the MSTAR chips are not registered, we calculate the azimuth 
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(a) T72 image (b) T72 ROI 



(c) articulated 
T72 image 



(d) articulated 
T72 ROI 



(e) ZSU image (f) ZSU ROI 



(g) articulated 
ZSU image 



(h) articulated 
ZSU ROI 



(a) T72 (b) ZSU 23/4 

Fig. 1. Articulated T72 tank and ZSU 23/4 antiaircraft gun photos 



Fig. 2. SAR images and ROIs (with peaks) for T72 tank and ZSU 23/4 at 66' 
azimuth 
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invariance as the maximum number of corresponding scattering centers (whose 
locations match within a given tolerance) for the optimum integer pixel trans- 
lation. This method of registration by finding the translation that yields the 
maximum number of corresondences has the limitation that for very small or no 
actual invariance it may find some false correspondences and report a slightly 
higher invariance than in fact exists. Figure 3 shows an example of the mean 
scatterer location invariance (for the 40 strongest scatterers) as a function of az- 
imuth angle span using T72 tank #132, with various definitions of persistence. 
The cases labeled ‘persists’ in Figure 3 enforce the constraint that the scatterer 
exist for the entire span of angles and very few scatterers continuously persist for 
even 5°. Because of this azimuthal variation we model the objects at 1° incre- 
ments. In the two cases not labeled ‘persists’ in Figure 3, scintillation is allowed 
and the location invariance declines slowly with azimuth span. The ‘within 1 
pixel’ results (that allow scintillation) are consistent with the one foot ISAR 
results of Dudgeon [2], whose definition of persistence allowed scintillation. 
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Fig. 3. Scatterer location persistence for T72 #132 



within 1 pixei 
exact match 
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2.2 Scatterer Location Invariance with Articulation 

The existance of scattering center locations that are invariant with object ar- 
ticulation is crucial to successfully recognizing articulated objects with non- 
articulated models, thus avoiding the combinatorial problem of modeling 360 
articulations x 360 azimuths. We express the scattering center location invari- 
ance with respect to articulation as the maximum number of corresponding 
scattering centers (whose locations match within a stated tolerance) for the op- 
timum integer pixel translation. Given an original version of a SAR object image 
with n scattering centers, represented by points at pixel locations Pi = {xi,yi) 
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for 1 < i < n and a translated, distorted version Pj = (1 < j < n) at a 

translation t = (tx,ty), we define a match between points Pj and Pi as: 

{ 1 if \x'j —tx — Xi\ < I and 
\yj-ty-y,\ < I 
0 otherwise 

where I = 0 for an ‘exact’ match and I = 1 for a match ‘within one pixel’. The 
scatterer location invariance, P„, of n scatterers, expressed as a percentage of 
matching points, is given by: 



Ln = max 

t 



100 

n 



n 

^min 

i=i 






Mij (t) 




where each point Pj is restricted to at most one match. Figure 4 shows the 
location invariance, P 40 , of the strongest 40 scattering centers with articulation 
for the T72 tank and ZSU 23/4 antiaircraft gun as a function of the hull azimuth. 
The average articulation invariance is 16.5% for the exact match cases and 56.5% 
for the within one pixel cases. 
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Fig. 4. Scatterer location invariance with articulation 



2.3 Scatterer Magnitude Invariance 

Using a scaled scatterer amplitude (S), expressed as a radar cross section in 
square meters, given by S = 100 + 101ogiQ(*^ + g^), where i and q are the 
components of the complex radar return, we define a percent amplitude change 
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{Ajk) as: Ajk = 100(5'^ — Sk)/Sj. (This form allows a larger variation for the 
stronger signal returns.) A location and magnitude match Qjk(t) is given by: 



Q jk (t) 



1 if Mjk{t) = 1 and \Ajk \ < I a 
0 otherwise 



where I a is the percent amplitude change tolerance. The scatterer magnitude 
and location invariance (In), expressed as a percentage of n scatterers, is given 
by: 

Figure 5 shows the probability mass functions (PMFs) for percent amplitude 
change for the strongest 40 articulated vs. non-articulated scattering centers of 
MSTAR T72 tank #a64 and ZSU #d08. Curves are shown both for the cases 
where the scattering center locations correspond within a one pixel tolerance 
and for all the combinations of scatterers whose locations do not match. 
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Fig. 5. Scatterer magnitude invariance with articulation 



3 SAR Recognition System 

Establishing an appropriate local coordinate reference frame is critical to reliably 
identifying objects (based on locations of features) in SAR images of articulated 
objects. In the geometry of a SAR sensor the ‘squint angle’, the angle between 
the flight path (cross-range direction) and the radar beam (range direction), 
can be known and fixed at 90°. Given the SAR squint angle, the image range 
and cross-range directions are known and any local reference point chosen, such 



102 



G. Jones III and B. Bhanu 



as a scattering center location, establishes a reference coordinate system. The 
relative distance and direction of the other scattering centers can be expressed in 
radar range and cross-range coordinates, and naturally tessellated into integer 
buckets that correspond to the radar range/cross-range bins. The recognition 
system takes advantage of this natural system for SAR, where a single basis 
point performs the translational transformation and fixes the coordinate system 
to a ‘local’ origin. 

Our model-based recognition system uses standard non-articulated models 
of the objects (at 1° azimuth increments) to recognize the same objects in non- 
standard articulated configurations. Using a technique like geometric hashing [7], 
the relative positions of the scattering centers in the range (R) and cross-range 
(C) directions are indices to a look-up table of labels that give the associated 
object type and pose. This is an efficient search for positive evidence that gener- 
ates votes for the appropriate object (and azimuth). The models and recognition 
engine have evolved from the previous 2D version [5] (which was applied to syn- 
thetic SAR data), which uses only the relative distances and the ‘exact’ scatterer 
locations, to a 6D version for the more challenging real MSTAR data, which uses 
more local features and accommodates a ‘within 1 pixel’ scatterer location uncer- 
tainty. In the 6D version the model look-up table labels contain four additional 
features: range and cross-range position of the ‘origin’ and the magnitudes (S) 
of the two scatterers. The model construction algorithm for the 6D recognition 
system is outlined in Fig. 6. 



1. For each model Object do 2 

2. For each model Azimuth do 3, 4, 5 

3. Obtain the location {R, C) and magnitude (S) of the strongest n scatterers. 

4. Order {R, C, S) triples by descending S. 

5. For each origin O from 1 to n do 6 

6. For each point P from 0-|-l to n do 7, 8 

7. dR = Rp — Ro\ dC = Cp — Co- 

8. At look-up table location dR, dC append to list entry with: Object, 
Azimuth, Ro, Co, So, Sp. 

Fig. 6. 6D model construction algorithm 



For ideal data one could use the strongest scatterer as the origin, however any 
given scatterer could actually be spurious or missing due to the effects of noise, 
articulation, occlusion, or non-standard configurations. Thus, we model and use 
(for recognition) all the scattering center locations in turn as the origin, so the 
size of the look-up table models and the number of nominal relative distances 
considered in the recognition of a test image is n(n — 1) /2, where n is the number 
of the strongest scattering centers used. 

In contrast to many model-based approaches to recognition [3], we are not 
‘searching’ all the models; instead we are doing table look-ups based on relative 
distances between the strongest scatterers in the test image. Each query of the 
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look-up table may generate votes for one or more potential candidate solutions. 
Comparison of each test data pair of scatterers with the model look-up table 
result (s) provides information on the range and cross-range translation and the 
percent magnitude changes for the two scatterers. Limits on allowable values for 
translations and magnitude changes are used as constraints to reduce the number 
of false matches. (The number of scattering centers used and the various con- 
straint limits are design parameters that are optimized, based on experiments, to 
produce the best recognition results.) Here, votes are accumulated in a 4D space: 
object, azimuth, range and cross-range translation. Also a (city-block) weighted 
voting method is used to reduce the impact of the more common small relative 
distances. To accommodate some uncertainty in the scattering center locations, 
the eight-neighbors of the nominal range and cross-range relative location are 
also probed in the look-up table and the final translation results are summed 
over a 3x3 neighborhood in the translation subspace. This voting in translation 
space, in effect, converts the consideration of scatterer pairs back into a group 
of scatterers at a consistent translation. The process is repeated with different 
scattering centers as reference points, providing multiple ‘looks’ at the model 
database to handle spurious scatterers that arise due to articulation, noise or 
other factors. To handle identification with ‘unknown’ objects, we introduce a 
criteria for the quality of the recognition result (e.g., the votes for the potential 
winning object exceed some threshold Vmin)- The recognition algorithm is given 
in Fig. 7. 



1. Obtain from test image the location (R,C) and magnitude (S') of n strongest 
scatterers. 

2. Order {R, C, S) triples by descending S. 

3. For each origin O from 1 to n do 4 

4. For each point P from O-fl to n do 5, 6 

5. dR = Rp — Ro\ dC = Cp — Co- 

6. For DR from dR-1 to dR+1 do 7 

7. For DC from dC-1 to dC+1 do 8, 9, 10 

8. weighted_vote = |DR| + |DC|. 

9. Look up list of model entries at DR, DC. 

10. For each model entry E in the list do 11 

11. IF |tr = Ro — Re\ < translationjimit and |tc = Co — Ce\ < translationjimit 
and |1 — Seo/So\ < magnitude-limit and |1 — Sep/Sp\ < magnitude-limit 
THEN increment accumulator array [Object, Azimuth, tr, tc] by weighted-vote. 

12. Query accumulator array for each Object, Azimuth, tr and tc, summing the 
votes in a 3x3 neighborhood in translation subspace about tr, tc; record the 
maximum vote-Sum and the corresponding Object. 

13. IF maximum vote-Sum > threshold 

THEN result is Object ELSE result is “unknown” . 



Fig. 7. 6D recognition algorithm 
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4 Recognition Results 

In these experiments the models are non-articulated versions of T72 #a64 and 
ZSU23/4 ^d08 and the test data are the articulated versions of these same 
serial number objects (with BRDM2 #e71 as an “unknown” confuser vehicle, 
when required). Recognition results are optimum using 38 scattering centers, a 
translation limit of ±5 pixels and a percent magnitude change of less than ±9%. 
Without the confuser, the overall forced recognition rate is 100% over a range 
from 14 to 40 scattering centers. In order to handle unknown objects we introduce 
a minimum vote threshold criteria for a target declaration; objects which fail 
to achieve the vote threshold are labeled as “unknown”. This can be seen in 
Table 1, an example confusion matrix for a vote threshold of 2100 votes, where 
the T72s are all correctly classified and only 2 ZSUs are misclassified as unknown 
for a 0.990 probability of correct identification (PCI), while 32 BRDM2s are 
misidentified as targets for a 0.126 probability of false alarm (PFA). 

A form of Receiver Operating Characteristic (ROC) curve, with PCI vs. 
PFA, can be generated by varying the vote threshold (e.g. from 1500 to 4000 in 
50 vote increments). The ROC curve in Fig. 8 shows the excellent articulated 
object recognition results that are obtained with this system. Figure 9 shows how 
the PCI varies with the percent articulation invariance (for a within one pixel 
location match) . The sets of curves are shown with different vote thresholds from 
1700 to 2700 to generate failures that illustrate the effect of location invariance 
on recognition rate. 



Table 1. Articulated object confusion matrix 



MSTAR (Public) 
articulated 
test objects 


Identification 

results 


T72 


ZSU 


unknown 


T72 315” turret 


98 


0 


0 


ZSU 315° turret 


0 


92 


2 


BRDM2 


32 


0 


222 



There is no real SAR data with occluded objects available to the general 
public. In addition, there is no standard, accepted method for characterizing or 
simulating occluded targets. Typically occlusion occurs when a tank backs up 
into a tree line, for example, so that the back end is covered by trees and only 
the front portion of the tank is visible to the radar. Thus, the ‘bright target’ 
becomes a much smaller sized object. However, the tree tops can produce ad- 
ditional ‘bright’ peaks that can be of similar strength to target peaks at many 
azimuths. The occluded test data in this paper is simulated by starting with 
a given number of the strongest scattering centers and then removing the ap- 
propriate number of scattering centers encountered in order, starting in one of 
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Probability of false alarm 



Fig. 8. Receiver Operating Characteristics 




Fig. 9. Recognition rate and articulation invariance 
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four perpendicular directions di (where di and ds are the cross range directions, 
along and opposite the flight path respectively, and ^2 and ^4 are the up range 
and down range directions). Then the same number of scattering centers (with 
random magnitudes) are added back at random locations within the original 
bounding box of the target. This keeps the number of scatterers constant and 
acts as a surrogate for some potential occluding object. 

In the occluded articulated object experiments the models are non-articulated 
versions of T72 #a64 and ZSU23/4 #d08 and the test data are the articulated 
versions of these same serial number objects that are occluded in the manner 
described above. Figure 10 shows the effect of occlusion on recognition of these 
MSTAR articulated objects for various numbers of scattering centers used. 
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Fig. 10. Effect of occlusion on articulated object recognition 



5 Conclusions 

The large azimuthal variations in the SAR signatures of objects can be success- 
fully captured by using models at 1 ° degree increments for a fixed depression 
angle. Useful articulation invariant features exist in SAR images of vehicles. In 
addition, where the scattering center locations are invariant with articulation, 
the corresponding magnitudes are also invariant within a small tolerance (typi- 
cally less than a 10% change). The feasibility of a unique concept for a system to 
recognize articulated objects in SAR images based on non-articulated models is 
demonstrated. The great variability of the MSTAR SAR data can be successfully 
overcome by using the scatterer magnitude as well as location, by accommodat- 
ing one pixel uncertainty in the scatterer location and by considering an object 
as a group of scatterers at a consistent translation. 
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Abstract. In this paper a robust method for bloek-based motion analysis 
in monoeular image sequenees is eonsidered. Due to the realized 
reeognition of false measurements by a neural reeognition system, gross 
errors in the motion trajeetories are avoided. Assuming eorrespondenees 
between regions in sueeessive images, a model based reeursive 
estimation teehnique is applied to estimate the motion of the observed 
image region. For this purpose a Kalman filter is used. The underlying 
kinematie model eontains assumptions about the motion, espeeially 
eonstant veloeity eomponents are assumed. Partieularly, the existenee of 
problematie image situations (e.g. partial oeelusion of objeets) leads to 
gross errors in the measuring values and false motion parameters are 
estimated. In order to eope with this problem an extension of the 
filtering algorithm by a neural reeognition system is proposed. This 
system reeognizes typieal problematie image situations and eontrols the 
adaptation of the filter. Seleeted results for real-world image sequenees 
are deseribed. 



1 Introduction 

Motion estimation teehniques represent a foeal point in the field of image sequenee 
proeessing. These teehniques try to interpret the ehanges in sueeessive images as 
eaused exelusively by motion of the objeets eontained in the seene. The analysis of 
image sequenees is often done in a reeursive manner. Reeursive methods use the 
redundaney of motion information in temporal direetion for a more robust and 
effieient motion analysis. The starting point is usually the measurement of 
displaeement and rotation using eorrespondenees in sueeessive images. These 
eorrespondenees ean be generated using pixel-based matehing algorithms [1]. In 
eontrast to approaehes determining eorrespondenees of eertain image features, sueh as 
line segments [2] or flexible eontour models of objeets [3], in this paper the gray- 

H. H. Nagel and F. J. Perales (Eds.): AMDO 2000, LNCS 1899, pp. 108-119, 2000. 
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levels of image blocks are used immediately without any feature extraction. Therefore, 
our method is not restricted to certain specific object features. Using a set of image 
blocks it is possible to analyze articulated motion or deformable objects too. 



2 Pixel-Based Matching Algorithm 



For the measurement of image motion a pixel-based matching algorithm is used. In 
modification to the conventional block matching algorithm [1] the determination of 
displacement and rotation with sub-pixel accuracy is possible. The matching 
algorithm realizes a comparision of image regions in successive images by calculating 
a similarity criterion. In this approach quadratic image blocks are compared (see 
Fig. 1). At first the translational motion and afterwards the rotational motion is 
determined. In order to generate the compared image windows, the sequence images 
are interpolated at the position of the sample points p‘. These points represent an 
equidistant quadratic grid, that can be located at any position and angle in the image. 
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image region 
in 

image 




image region 
in 

image 



MATCHING 



translation 





V 












r 



rotation 

7 ^ 



Fig. 1. Matching of image regions in successive images 



The correspondences of image blocks in successive images are obtained using the 
following matching criterion: 

MAD(aj) = ^^^X|B,(pi)-B,„(p;„)| ; pU, = /(pi 

translation: 

rotation: m,. = cOj^ ; < a)/. < 

imaged ; p'^ .../- th sample point in B^, ; /.. .motion function 

^...number of sample points in the window 

The function specified in Eq. 1 represents a simplified block-based correlation 
measure for the determination of translation (MAD.^) and rotation (MAD^). These 
functions are generated by the comparison of a reference window in image B^ with 
all possible search windows located within a search area in the next image B^^j . Since 
they calculate the Mean Absolute Difference between two image windows, they are 
called MAD-functions. The coordinates of the minimum of the MAD ^-criterion 

represent the measured velocity [v^ and the position of the MAD ^-minimum 

represents the measured angular velocity ® of the reference window. These values 
refer to the position respectively the angle at time k and make it possible, to measure 
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the midpoint position [p^ the angle of the considered window in the 

next image. The index M indicates that these parameters represent measuring values. 

In order to derive highly resolved measuring values from the discrete MAD- 
functions, an approximation of the similarity criteria around the minimum is realized. 
For this purpose the discrete minimum value of the MAD^ - and the MAD^ -criterion 
and their 2 respectively 4 neighboring values are approximated by the following 
model functions MAD"’"'* . 

model function 1: MAD““‘’(v^,v,,) = +Cj^v^^ (2) 

model function 2: MAD'™‘* (<y) = -H c™ 

After determining the coefficients c*‘* the minimum position of each of the model 
functions is calculated using the first derivations of the MAD“°‘' -functions. 

Own experiments on real-world images have shown, that measuring values 
calculated by this procedure are corrupted by noise. Depending on the required 
accuracy the influence of noise can be more or less disturbing. Particularly, if there are 
applications with high requirements regarding the quality of motion parameters, it is 
necessary to minimize the influence of noise. In the next paragraph a recursive 
filtering algorithm is presented which is able to reduce the influence of noise. 



3 Recursive Motion Estimation 

The procedure which reduces the influence of noise from the measuring values is 
called estimation. Considering sequences of data the estimation is often realized in a 
recursive manner. In this paper such an estimation procedure is used to process the 
motion parameters (measuring values) calculated by the matching algorithm as 
described in the section before (see scheme shown in Fig. 2). 

measuring 
A values 

Fig. 2. Motion estimation for noise reduction 

In the field of image sequence processing Kalman filtering techniques have been 
gained in their importance as a method for model-based state estimation problems 
during the past decades [4, 5]. These techniques are used ordinarily to estimate the 
motion parameters of scene objects. 

Using the Kalman filter algorithm to estimate motion parameters, they are derived 
from two parts (see Fig. 3). The fist one is represented by the measuring values 
calculated by the pixel-based matching algorithm in this case. The second part is the 
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underlying kinematic model that describes the assumptions about the motion in the 
image sequence. The filtering algorithm controls the influence of the two parts. 




Fig. 3. Estimation of motion parameters 



A set of motion parameters (for instance parameters for position and velocity) can 
be combined in the state vector S. The conventional Kalman filter, as described in [6], 
requires a linear model of the observed state vector S in the state space {system 
model). The definition of an appropriate systeru model requires assumptions about the 
observed motion. A simple kinematic model, applicable to monocular image 
sequences, assumes motion of rigid objects in a plain perpendicular to the cameras 
optical axis and is used in this paper. Furthermore, the motion in a short time interval 

is assumed to be constant and to be composed of a translational velocity v and 
an angular velocity o ) . The corresponding parameters for the position and the angle 
are and qy. 

In many cases the ruotion of an object part between 2 successive tirue steps (time 
interval At ) can be described by the following simple recursive kinematic model: 
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The uncertainty in the system model is considered by the separate stochastic noise 
vectors . Each element of w, is assumed as caused by a zero-mean white Gaussian 

process. Using a set of state vectors s respectively system models it is possible to 
describe articulated motion or defonuable objects too. 

A measuring algorithru calculates the measurement vector x from correspondences 
of image regions in successive images (see section 2). The measurement equation 
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represents the linear eonneetion between the measurement veetor x and the state 
veetor S and ean be written as follows: 



X* = F* s, + n. 



V 




1 0 0 0 0 0 


M 

P. 


; F* = 


0 1 0 0 0 0 


1 

; ^ ^ 

1 




0 0 1 0 0 0 



( 4 ) 



The values of represent the measured midpoint position and cp^ is the 

measured angle of the eonsidered image region. Since, as mentioned above, the 
measurement (matching) is always executed between two successive images, the 
current measurement vector is derived from the sum of the vector one time 
step before and the determined velocity components. 

The noise in the measuring values is assumed by the stochastic noise vector . 
The elements of this vector are assumed to have a zero-mean Gaussian probability 
density function as the elements of the vector • 

Fig. 4 shows the internal signal flow chart of the Kalman filter (see for instance 
[7]). The updated state vector s* is composed of two parts, whereof the first one is the 
vector Sj and the second one represents the difference between x^. and the estimated 
measurement s^. The Kalman gain is calculated recursively and represents a 
variable amplification matrix for this difference. The predicted state vector at the next 
time step (s^^j) is calculated using the system matrix Hj. and the updated state vector 
at time k. 




Fig. 4. Signal flow chart of the Kalman filter 

The Kalman filter yields a recursive estimate s of the state vector s which is 
optimal in the sense of a least square optimization criterion. This criterion minimizes 
the trace of the covariance matrix of the estimation error s — s . 

The stochastic properties of the noise vectors Wj and are taken into 
consideration by the covariance matrices of both vectors. Conventionally the elements 
of the covariance matrices are assumed to be constant and are determined on the basis 
of knowledge about the noise characteristic. On the whole, they influence the 
smoothing capability of the filter. 

For further details and the mathematical description of the well-known filter 
equations the reader is referred to the appropriate literature (for instance [7]). 
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Besides the capability of the filter, to reduce the influence of noise, there are some 
other important advantages. For instance the inclusion of an appropriate system model 
makes it possible, to predict the state vector one step into the future. Due to this 
capability, expectations about the motion parameters at the next time step are 
specified without calculating a measurement. They can be used effectively to reduce 
the search area in the matching algorithm. 

In the field of image processing the measuring values have a specific characteristic, 
that depends on the type of the considered image data, the image contents and some 
other important conditions during the image acquisition. Particularly, the existence of 
problematic image situations leads to gross errors (outliers) in the measuring values. 
Such situations, for instance, arise from overlapping motion and partial occlusion of 
objects or illumination changes in the scene. Fig. 5 (on the left) shows the first image 
from a real-world image sequence that contains a moving object (bus). Parts of the 
object are partially occluded by other stationary objects in the foreground (specified in 
Fig. 5) during the motion. 
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Fig. 5. Left: First image from a real-world image sequence. Problematic image situations are 
indicated by white arrows and the starting window for the motion estimation is depicted as a 
square; Right: Results of motion estimation (position and angle) for the image window using 
the conventional Kalman filter (KF) and the corresponding reference values at each time step 



This occlusion leads to deformations of the matching criteria, if the occlusion 
appears in the reference window or in the search window. Since the measuring values 
for the position and the angle of the considered window (square in the image of Fig. 5, 
on the left) are derived from the minimum position of these matching criteria, the 
stream of measurements consequently contains outliers or useless values. An example 
for such distorted matching criteria is shown in Fig. 6 (column k = 9). There are no 
significant minima and the MAD-values are higher compared to the non-distorted 
matching criterion depicted in Fig. 6 (column k = 5). 

Due to the existence of outliers in the measuring values, the above mentioned 
assumptions concerning the stochastic noise vector are often incorrect. It can be 
noticed, that outliers are not taken into consideration in the conventional Kalman filter 
algorithm. As a result, the estimated state vector s does not converge against the 
‘true’ object motion or there is a considerable delay in the filter response. 
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Fig. 6. Different MAD-criteria for selected time steps k 



Consequently, the considered image region for the sequence in Fig. 5 (on the left) 
does not contain the same part of the moving object, as in the first image after the 
appearance of occlusion. In Fig. 5 (on the right) the estimated midpoint position 




and angle ^ of the image region are depicted in the coordinates of the 



image scene. Compared with the reference values it is visible, that there is a 
permanent divergence of the motion trajectory after the occlusion. 

Obviously the conventional Kalman filter is not able to eliminate the influence of 
problematic image situations to the estimates, if there is no further adaptive 
component in the filter. 



4 System for Robust Motion Estimation 

In order to cope with the problems, arising from the problematic image situations 
described in the previous section, an extension of the conventional filtering structure is 
suggested. The idea is to evaluate the quality of the measuring values at each time step 
and to adapt certain parameters of the filter in dependence on the reliability of the 
measurement. 

The proposed system for a robust motion estimation is shown in Fig. 7. A new 
approach is that both the MADj^- and the MAD -criterion calculated by the pixel- 
based matching algorithm are used for the determination of the measurement vector 
Xj, and as input for a recognition system. This system realizes an adaptation of the 
Kalman filter depending on several features of the MAD-criteria. For this purpose the 
elements of a later defined gating matrix are adjusted. During the occurrence of 
problematic image situations it is later described, that these elements are small and the 
filter derives the motion parameters mainly from the underlying kinematic model. 
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Fig. 7. Proposed system for robust motion estimation 



Own investigations have shown that certain features of the matching criterion 
represent a measure for the reliability of the corresponding measuring value [8, 9]. In 
[9] a set of 6 simple features has been found empirically, that are suitable to 
describe the main properties of MAD-criteria. Furthermore it is necessary, that the 
used features are independent from the minimum position of the respective matching 
criterion. Essential features in this context are, for instance, the mean of the MAD- 
criterion and the value of the minimum. The features are elements of the feature vector 
f. The feature extraction is realized separately for the MAD^- and the MAD^- 
criterion (see Fig. 8). 
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Fig. 8. Quality assessment of matching criteria 



It is proposed to use a recognition system that assesses the quality of the respective 
MAD-function based on these features. The recognition system is realized by artificial 
neural networks (ANN). In this case two simple Multi-Layer Perceptons (MLP) have 
been employed (see Fig. 8). Typically they consist of several layers of neurons. The 
output of each neuron is connected with the input of all neurons in the next layer. The 
first layer (input layer) stores the input vector f . The outputs of the MLP’s are 
equivalent to the diagonal elements of the above mentioned gating matrix Gj. • 

Neural networks have the ability to learn their recognition task from examples. This 
feature is very useful, because the many different effects of problematic image 
situations on the matching criterion can be better described by typical examples (e.g. 
see Fig. 6) than by mathematical models. Another advantage of neural networks is 
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their generalization ability, that means, they are able to produce an appropriate 
response also for those input data, which are similar to the training examples, but have 
never been trained before. Therefore, the variety of possible matching criteria can be 
considerably reduced and only typical candidates have to be contained in the training 
data set. Because of limitations in this paper a detailed description of the structure and 
training of the used feed-forward networks are left out here. For details the reader is 
referred to [9]. 

Fig. 9 shows the proposed filtering structure. Compared with the conventional 
Kalman filter (see Fig. 4) the difference between the measurement vector and the 
estimated measurement is multiplied by the gating matrix , before the 

multiplication by the Kalman gain is realized. So it is possible to weight each 
element of the difference vector corresponding to the output of the respective MLP. 




Fig. 9. Signal flow chart of the proposed filtering structure 



The 3x3 gating matrix G^, is defined as follows: 



G 



k 



gV' 0 
0 gf"’ 

0 0 



0 

0 



0<gr <1 



(5) 



Due to the properties of the MLP’s the diagonal elements represent floating 
point values. This is an essential advantage compared to those approaches (for in- 
stance [10]), using only two discrete values (1 or 0) for a „good/bad“-classification. 



5 Selected Examples for the Representation of Results 

The proposed system was tested for the analysis of several real-word image 
sequences. In this section selected examples are presented. Fig. 10 shows the first 
image from a sequence that contains problematic image situations. In this case parts of 
the moving object (toy) are partially occluded by stationary objects in the foreground. 
As a result, the state vector s estimated by the conventional Kalman filter does not 
describe the ‘true’ motion of the considered image region. This error is visible in 
Fig. 10 (on the right), where the position and the angle of the respective image block 
is depicted at each time step. There is a permanent divergence of the estimated motion 
parameters after the first occlusion. 
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Fig. 10. Results of motion estimation using the proposed system (on the left) respectively the 
conventional Kalman filter (on the right) 

Due to the recognition system (see section 4) the proposed system is able to 
recognize problematic image situations and to readjust the elements of gating matrix 
Gj, depending on the degree of disturbance. That means that the incoming (false) 
measurements in the vector Xj. are weighted less than before and the Kalman filter 
uses more and more the internal system model for the state vector update. 
Consequently, the state vector s* is estimated mainly from the internal system model. 
As a result in the example shown in Fig. 10 (on the left), the correct estimation of the 
image block position and angle is achieved even in case of partial occlusion. 

The dark image blocks indicate, that the sum of the 3 diagonal elements g*' '* of the 
gating matrix is less than the threshold 1.5. This threshold is chosen only for the 
purpose of visualization. As mentioned above, the elements g*' '* represent floating 
point values (see example in Fig. 1 1). 
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Fig. 11. Diagonal elements g[*’** and determined by the recognition system for the 

upper image block of the sequence in Fig. 10 at each time step 

The proposed system was tested for both indoor sequences acquired under definite lab 
conditions and real-word outdoor sequences. An example for the latter image 
sequences is shown in Fig. 12. There are various problematic image situations (e.g. 
overlapping motion, partial occlusion, shadows on moving objects and illumination 
changes) within the scene. 
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Fig. 12. First image of the sequence TRAFFIC and position of selected image regions 

In Fig. 13 the results of motion analysis for the TRAFFIC sequence are shown. The 
proposed system is able to cope with the occuring different problematic image 
situations. The reason for this capability is the above mentioned generalization 
property of the recognition system (see section 4). 




Fig. 13. Results of motion estimation for the TRAFFIC sequence (see Fig. 12) using the 
proposed system. For the purpose of a better visualization the estimated position and angle of 
the image blocks are plotted in the vertical direction. The dark blocks indicate, that the sum of 
the 3 diagonal elements of the gating matrix Gj. is less than the threshold 1.5 
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6 Conclusions 

The described results show, that the proposed system according to Fig. 7 makes it 
possible, to achieve a considerable improvement of the motion estimation in the case 
of problematic image situations. This system contains a neural recognition system, that 
realizes the adaptation of the motion estimation algorithm. The recognition system 
uses simple features of the matching criterion in order to assess the quality of the 
incoming measurements. A further improvement of the recognition capability can be 
expected when using more complex features of the matching criteria. Intended work in 
the future will investigate a motion estimation system for the analysis of color image 
sequences. Another goal is, to use object-specific shaped image regions instead of the 
presently used fixed quadratic blocks. The proposed system shall be used in future to 
analyze articulated motion and deformable objects. 



Acknowledgements 

This work was supported by the LSA grant 1441A/8386H . 



References 

1. Musmann, FI. G.; Pirsch, P.; Grallert, H. J.: Advances in Picture Coding. Proc. of 
the IEEE, 1985, Volume 73, No. 4, 523-548. 

2. Zhang, Z.: Token Tracking in a Cluttered Scene. Research Report N°2072, INRIA, 
ISSN 0249-6399, October 1993. 

3. Blake, A.; et. al.: Affine-Invariant Contour Tracking With Automatic Control of 
Spatiotemporal Scale. In: Proc. of 4. Int. Conf on ComputerVision, Berlin, May 
1 lth-14th, 1993, IEEE Computer Society Press, ISBN 0-8186-3870-2. 

4. Maybeck, P. S.: Stochastic Models, Estimation and Control. Academic Press Inc., 
Vol. 1-3, ISBN 0-12-48070(lor2or3)-X, 1982. 

5. Sorenson, FI. W.: Kalman Filtering: Theory and Application. IEEE Press, ISBN 0- 
87942-191-6, 1985. 

6. Kalman, R. E.: A New Approach to Linear Filtering and Prediction Problems, 
1960, In: [5] from this list, pp. 16-26. 

7. Haykin, S.: Adaptive Filter Theorie. 3 '^^ ed-, Prentice-Hall Inc., 1996. 

8. Schnelting, O.; Michaelis, B.; Mecke, R.: Motion Estimation By Using Artificial 
Neural Networks, Proc. of lEEE-Workshop on Nonlinear Signal and Image 
Processing, Neos Marmaras, Greece, June 20th-22th, 1995, Vol. 2, pp. 714-717. 

9. Mecke, R.: Grauwertbasierte Bewegungsschatzung in monokularen Bildsequenzen 
unter besonderer Beriicksichtigung bildspezifischer Stdrungen, PhD Thesis, Otto- 
von-Guericke-University of Magdeburg, Department Electrical Engineering, 1999. 

10. Rao, R. P. N.: Robust Kalman Filters for Prediction, Recognition, and Learning. 
Technical Report 645, University of Rochester, Computer Science Department, 
Rochester, New York, Dezember 1996. 




Spectral Correspondence for Deformed 
Point-Set Matching 



M. Carcassoni and E. R. Hancock 

Department of Computer Science, Univerisity of York 
York YOl 5DD, UK. 



Abstract. This paper describes a modal method for point-set tracking 
in motion sequences. The framework for our study is the recently re- 
ported dual-step EM algorithm of Cross and Hancock [.3]. This provides 
a statistical framework in which the structural arrangement of the point- 
sets provides constraints on the pattern of correspondences used to esti- 
mate alignment parameters. In this paper our representation of point-set 
structure is based on the point-adjacency matrix. Using ideas from spec- 
tral graph-theory, we show how the eigen-vectors of the point-adjacency 
matrix can be used to compute point correspondence probabilities. We 
show that the resulting correspondence matching algorithm can be used 
to track deforming point-sets detected in motion sequences. 



1 Introduction 

Correspondence matching is key to the analysis of articulated and deformable 
objects in motion sequences. The reason for this is that correspondences be- 
tween image features, such as points or lines, are required before the degree of 
articulation or deformation can be assessed via detailed alignment. If the ar- 
rangement of object features is represented using a relational structure such 
as an adjacency graph, then consistent correspondence patterns can be recov- 
ered using techniques such as inexact graph-matching [13]. Moreover, using the 
structural arrangement of the features renders the estimation of correspondences 
quasi invariant to a variety of deformations. However, one of the shortcomings 
of existing inexact graph-matching methods is their computational overheads, 
which can render them unsuitable for motion sequence analysis. 

An efficient alternative is to use a modal representation of the relational 
arrangement of features. Over the past decade, there has been considerable effort 
expended in developing modal models of point-set deformation, Although the 
literature on the topic is dense, there are two familiar examples which merit 
special mention. In the point distribution model of Cootes and Taylor [2] the 
deformation of the point-set is modelled using the eigen-modes of the point- 
position covariance matrix for a set of training examples. By contrast, the finite 
element model of Sclaroff and Pentland [6] uses the modes of vibration the points 
under elastic forces. Although this work has made considerable strides, it can be 
criticised on the grounds that it is based on a central clustering model. In other 
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words, point-set deformation is measured relative to the centre-of-gravity. As a 
result a considerable body of information concerning the relational arrangement 
of the points is overlooked. 

It is the modal analysis of relational structure which is central to the field of 
mathematics known as spectral graph theory. This is a term applied to a family 
of techniques that aim to characterise the global structural properties of rela- 
tional graphs using the eigenvalues and eigenvectors of the adjacency matrix [1]. 
Although the subject has found widespread use in a number of areas including 
structural chemistry and routeing theory, there have been relatively few appli- 
cations in the computer vision literature. The reason for this is that although 
elegant, spectral graph representations are notoriously susceptible to the effect 
of structural error. In other words, spectral graph theory can furnish very ef- 
ficient methods for characterising exact relational structures, but soon breaks 
down when there are spurious nodes and edges in the graphs under study. 

Spectral methods for graph analysis invariably commence by computing the 
Laplacian matrix. This is closely related to the node adjacency matrix. The 
diagonal elements of the Laplacian matrix are equal to the degree of the nodes 
(vertices) and the off diagonal elements are unity if the corresponding nodes 
are connected by an edge, and are zero otherwise. However, it is also common 
to work with proximity or property matrices where the off diagonal elements 
reflect the difference in node attributes such as position [9] or orientation [8]. 
Once a matrix characterisation of the graph is to hand then the eigenvalues and 
eigenvectors are computed. The main idea behind spectral graph theory is to use 
the distribution of eigenvalues to provide a compact summary of graph-structure. 

In the computer vision literature there have been a number of attempts to 
use spectral properties for graph-matching and object recognition. Umeyama has 
an eigen-decomposition method that matches graphs of the same size [12]. Bor- 
rowing ideas from structural chemistry, Scott and Longuet-Higgins were among 
the first to use spectral methods for correspondence analysis [7]. They showed 
how to recover correspondences via singular value decomposition on the point 
association matrix between different images. In keeping more closely with the 
spirit of spectral graph theory, yet seemingly unaware of the related literature, 
Shapiro and Brady [9] developed an extension of the Scott and Longuet-Higgins 
method, in which point sets are matched by comparing the eigenvectors of the 
point proximity matrix. Here the proximity matrix is constructed by comput- 
ing the Gaussian weighted distance between points. The eigen-vectors of the 
proximity matrices can be viewed as the basis vectors of an orthogonal trans- 
formation on the original point identities. In other words, the components of 
the eigenvectors represent mixing angles for the transformed points. Matching 
between different point-sets is effected by comparing the pattern of eigenvectors 
in different images. Shapiro and Brady’s method can be viewed as operating in 
the attribute domain rather than the structural domain. Horaud and Sossa have 
adopted a purely structural approach to the recognition of line-drawings. Their 
representation is based on the immanantal polynomials for the Laplacian matrix 
of the line-connectivity graph. By comparing the coefficients of the polynomials. 
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they are able to index into a large data-base of line-drawings. In another appli- 
cation involving indexing into large data-bases, Sengupta and Boyer have used 
property matrix spectra to characterise line-patterns. Various attribute represen- 
tations are suggested and compared. Shokoufandeh, Dickinson and Siddiqi [10] 
have shown how graphs can be encoded using local topological spectra for shape 
recognition from large data-bases. 

The focus of this paper is to investigate the use of property matrix spectra for 
matching point-sets undergoing deformation in motion sequences. As mentioned 
above, spectral methods offer an attractive route to correspondence matching 
since they provide a representation that can be used to characterise graph struc- 
ture at the global level. If used effectively, the spectral representation can be 
used for rapid matching by comparing patterns of eigenvalues or eigenvectors. 
Unfortunately, these advantages must be traded against their relative fragility to 
the addition of noise and clutter. For instance, although the methods of Horaud 
and Sossa [11] and Shapiro and Brady [9] work well for graphs that are free of 
structural contamination, they do not work well when the graphs are of different 
size. 

To render the modal analysis more robust, we adopt a statistical framework. 
The framework for our study is provided by the recently reported dual-step EM 
algorithm of Cross and Hancock [3] . This interleaves the processes of correspon- 
dence matching and point-set alignment in an iterative matching process. The 
idea underpinning the method is to use the consistency of the pattern of corre- 
spondences to constrain the estimation of alignment parameters. The method is 
proved effective in the matching of planar point-sets under affine and perspective 
geometries. Despite proving effective the method is computationally demand- 
ing. The reason for this is that the correspondence probabilities, which weight 
contributions to the expected log-likelihood function for alignment parameter 
estimation are computed using a time-consuming dictionary-based method. The 
aim in this paper is to address this deficiency by using a more efficient method 
for computing the correspondence probabilities using modal structure. 

2 Dual-Step EM Algorithm 

We are interested in the correspondence matching of deformed point-sets in 
motion sequences. The framework for our study is provided by the dual-step 
EM algorithm of Cross and Hancock [3]. This statistical algorithm has been 
demonstrated to improve matching performance by interleaving the processes 
of alignment and correspondence. In this section we review this method before 
proceeding to show how modal structure can be used to constrain correspondence 
matching. 

2.1 Prerequisites 

The aim in this paper is to use the dual-step EM algorithm of Cross and Han- 
cock [3] to render the process of modal correspondence matching robust. Before 
we detail the algorithm, we provide some of the formal ingredients of the method. 
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AfRne Geometry: Suppose that is the geometric transformation that best 
aligns a set of image feature points w with their counterparts in a model z at it- 
eration n of the algorithm. Each point in the image data set is represented by an 
augmented position vector Wi = (a;,, j/i, 1)^ where i is the point index. This aug- 
mented vector represents the two-dimensional point position in a homogeneous 
coordinate system. We will assume that all these points lie on a single plane in 
the image. In the interests of brevity we will denote the entire set of image points 
by w = {tUi, Vt C T>} where T> is the point set. The corresponding fiducial points 
constituting the model are similarly represented by z = G A4} where A4 

denotes the index-set for the model feature-points Zj . 

The Cross and Hancock alignment method [3] is quite general and can be 
used with a variety of image deformation models. However, in this paper we 
confine our attention to affine deformations. The affine transformation has six 
free parameters. These model the two components of translation of the origin on 
the image plane, the overall rotation of the co-ordinate system, the overall scale 
together with the two parameters of shear. These parameters can be combined 
succinctly into an augmented matrix that takes the form 






* 1,1 








^ 2,1 


^ 2,2 


43 


( 1 ) 



With this representation, the affine transformation of co-ordinates is computed 
using the following matrix multiplication 



Correspondences: The recovery of the parameters of the transformation ma- 
trix <l>, requires correspondences between the point-sets. In other words, we 
need to know which point in the data aligns with which point in the model. 
This set of correspondences between the two point sets is denoted by the func- 
tion : M ^ V from the nodes of the data-graph to those of the model 
graph. According to this notation the statement = j indicates that there 

is a match between the node i & T> oi the model-graph to the node j € Ai of 
the data-graph at iteration n of the algorithm. 



2.2 The Dual-Step EM Algorithm 

Cross and Hancock’s contribution was to present an extension of the standard 
EM algorithm in which the structural consistency of correspondence matches can 
be used to gate contributions to the expected log-likelihood function [3]. This 
idea is closely related to the hierarchical mixture of experts algorithm of Jordan 
and Jacobs [5]. However, the method uses a dictionary method for computing 
the correspondence probabilities which is both localised and time consuming. 
The aim here is to replace the dictionary-based method used to compute the 
probabilities with a modal correspondence method. 



124 M. Carcassoni and E. R. Hancock 

Expected Log-likelihood: According to Cross and Hancock we seek both 
correspondence matches (i.e. the function /) and transformation parameters 
which maximise the expected log-likelihood 

Q(<?("+')|<^(”)) = E E (2) 

iev jeM 

The meaning of this expected log-likelihood function requires further comment. 
The measurement densities p{wi\zj,<P^'^~^^'>) model the distribution of alignment 
errors between the data-point position Wi and the the model point position Zj 
at iteration n of the algorithm. The log- likelihood contributions at iteration n-l- 1 
are weighted by the a posteriori measurement probabilities P{zj\wi,<P^'^')) com- 
puted at the previous iteration n of the algorithm. The individual contributions 
to the expected log-likelihood function are gated by the structural matching 
probabilities • These probabilities measure the consistency of the pattern of 
correspondences when the match (i) = j is made. Their modelling is the 
topic of Section 3. 



Expectation: In the expectation step of the EM algorithm the a posteriori 
probabilities of the missing data (i.e. the model-graph measurement vectors, Zj) 
are updated by substituting the point position vector into the conditional mea- 
surement distribution. Using the Bayes rule, we can re-write the a posteriori 
measurement probabilities in terms of the components of the corresponding con- 
ditional measurement densities 






af"^p{w^\zj,<P^^^) 
Ej'eM a^p^p{wi\zp,<p(^)) 



( 3 ) 



The mixing proportions are computed by averaging the a posteriori probabilities 
over the set of data-points, i.e. l^i> order to 

proceed with the development of a point registration process we require a model 
for the conditional measurement densities, i.e. p{wi\zj,<P^'^'>). Here we assume 
that the required model can be specified in terms of a multivariate Gaussian dis- 
tribution. The random variables appearing in these distributions are the align- 
ment errors for the position predictions of the ith data point delivered by the 
current estimated data-point positions. Accordingly we write 
















( 4 ) 



In the above expression S is the variance-covariance matrix for the position 
errors. 



Maximisation: The dual step EM algorithm iterates between the two inter- 
leaved maximisation steps for alignment parameter estimation and estimating 
correspondence assignments. 
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— Maximum a posteriori probability correspondence matches: These are 
sought so as to maximise the a posteriori probability of structural match. 
The update formula is 

/("+!)(*) = argmaxP(2,>i,<l>("))cf”^ (5) 

j&M 

— Maximum likelihood alignment: In the case of afhne geometry, the trans- 

formation is linear in the parameters. This allows us to locate the maximum- 
likelihood parameters directly by solving the following system of saddle-point 
equations for the independent affine parameters running over the in- 

dices k = 1,2 and I = 1, 2, 3 






= 0 



(6) 



For the affine transformation the set of saddle-point equations is linear, and 
are hence easily solved by using matrix inversion. The updated solution 
matrix is given by 



(p(n+l) _ 



X 






(7) 



where the elements of the matrix U are the partial derivatives of the affine 
transformation matrix with respect to the individual parameters, i.e. 



U = 





(8) 



This allows us to recover a set of improved transformation parameters at 
iteration n -I- 1. Once these are computed, the a posteriori measurement 
probabilities may be updated by applying the Bayes formula to the mea- 
surement density function in the expectation step. 



3 Spectral Methods for Correspondence Matching 

The aim in this paper is to show how the modal structure of point-sets can be 
used to compute the correspondence probabilities required by the dual-step EM 
algorithm. The modal approach to correspondence commences by enumerating a 
point proximity matrix. This is a continuous counterpart of the graph adjacency 
matrix. Rather than setting the elements to unity or zero depending on whether 
or not there is a connecting edge between a pair of nodes, the elements of the 
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proximity matrix are weights that reflect the strength of a pairwise adjacency 
relation. Once the proximity matrix is to hand, then correspondences are located 
by computing its eigenvectors. The eigenvectors of the proximity matrix become 
the columns of a transformation matrix which operates on the original point 
identities. The rows of the transformation matrix represent the components of 
the original points in the directions of the eigenvectors. We can locate point 
correspondences by searching for rows of the transformation matrix which have 
maximal similarity. 

Unfortunately there are two drawbacks with the modal method of corre- 
spondence. Firstly, there is no clear reason to use Gaussian weighting in favour 
of possible alternatives. Moreover, the Gaussian weighting may not be the most 
suitable choice to control the effects of pattern distortion due to point movement 
under measurement error or deformation under affine or perspective geometry. 
Secondly, the method proves fragile to structural differences introduced by the 
addition of clutter or point drop-out. 

The aim in this section is to address these two problems. We commence 
by considering alternatives to Gaussian weighting. Next we suggest how the 
comparison of the eigenvectors can be effected in a manner which is robust to 
measurement error and poor ordering of the eigenvalues. The contribution is 

(n) 

therefore to show how the correspondence probabilities Q ^ can be computed in 
an efficient and robust manner. 



3.1 Point Proximity Matrix 



In this section we show how to construct the weighted point-proximity matrix. 
The role of the weighting function is to model the probability of adjacency 
relations between points. 

The standard way to represent the adjacency relations between points is to 
use the Gaussian proximity matrix. If i and are two data points, then the 
corresponding element of the proximity matrix at iteration n of the algorithm is 
given by 

Hy = 



2s2 



(n) (n) I 

w) — w], 



(9) 



where s is a width parameter. 

However, we have investigated alternative weighting functions suggested by 
the robust statistics literature. One of the most successful has proved to be 
a continuous variant of Huber’s kernel, generated by the hyperbolic tangent 
function 



Hy = 



(n) (n) I 

w] — w), 



• tanh 



^ 1 1 (^) ('^) I 

-\\wl -wl, 'I 



(10) 



The modal structure of the point adjacency matrix is found by solving the 
eigenvalue equation HEi = A; if/, where A/ is the eigenvalue of the ma- 
trix H and Ei is the corresponding eigenvector. We order the vectors accord- 
ing to the size of the associated eigenvalues. The ordered column-vectors are 
used to construct a modal matrix V = {E\,E2,E^, ). The column index 
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of this matrix refers to the order of the eigenvalues while the row-index is 
the index of the original point-set. This modal decomposition is repeated for 
both the data and transformed model point-sets to give a data-point modal 
matrix = (iff, if|p|) and a model-point modal matrix Vm = 

Since the two point-sets are potentially of different 
size, we truncate the modes of the larger point-set. This corresponds to remov- 
ing the last 1 1 1? I — |Ai|| rows and columns of the larger matrix. The resulting 
matrix has o = min[I?, Ai] rows and columns. 

The modal matrices can be viewed as inducing a linear transformation on the 
original identities of the point-sets. Each row of the modal matrix represents one 
of the original points. The column entries in each row measure how the original 
point identities are distributed among the different eigen-modes. 



3.2 Correspondences 

With the modal structure of the point-sets to hand, then we can locate corre- 
spondences by comparing the rows of the modal matrices. We have investigated 
three different methods to compute the correspondence probabilities required by 
the dual-step EM algorithm. 



Shapiro and Brady: Based on this eigen-decomposition Shapiro and Brady [9] 
find correspondences by comparing the rows of the model matrices Vm and Vd ■ 
The decision concerning the correspondences is made on the basis of the simi- 
larity of different rows in the modal matrices for the data and the model. The 
measure of similarity is the Euclidean distance between the elements in the corre- 
sponding rows. According to Shapiro and Brady the correspondence probabilities 
are assigned according to the following binary decision 

= ifj = argminyX;Lill^i>”^(bf (u) 
1 0 otherwise 



Correspondence Probabilities: Rather than using the binary correspondence 
pattern of Shapiro and Brady to gate contributions to the expected log-likelihood 
function, we can use the elements of the two modal matrices to compute the 
probabilities of correspondence match. This is done by comparing the elements 
of the two matrices on a row-by-row basis. A simple way of computing the 
probabilities is to assume that the vectors are subject to Gaussian measurement 
errors. As a result, the probability that data-point i is in correspondence with 
model data-point j is equal to 



exp 






J2feM exp 




IW 



( 12 ) 



where ^ is a parameter which is set to control the width of the probability 
distribution. 
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Robust Correspondence Probabilities: The shortcoming of this method 
for computing the correspondence probabilities is the effect of outlier measure- 
ment errors. When there is a significant difference between one or more of the 
components of the eigenvectors, then these errors dominate the argument of the 
exponentials. This will have the tendency to flatten the distribution and will 
result in ambiguity and equivocation concerning the pattern of correspondences. 
One way to make the computation of correspondences robust to outlier mea- 
surement error is to accumulate probability on a component by component basis 
over the eigenvectors. To do this we define the correspondence probability to be 



E/=i exp 






Ei'eAT ELi exp 





(13) 



In this way large measurement errors contribute insignificantly through the in- 
dividual exponentials appearing under the summation over the components of 
the eigenvectors. 

4 Sensitivity Study 

In this section we investigate the noise sensitivity of the new modal correspon- 
dence method reported in this paper. Our experiments are conducted with ran- 
dom point sets. We investigate two sources of error. The first of these is random 
measurement error or point-position jitter. Here we subject the positions of the 
points to Gaussian measurement error. The parameter of the noise process is 
the standard deviation of the point position error. The second source of error is 
structural. Here we remove random points. This is the most destructive type of 
error for modal methods. 

We compare three different algorithms. The first of these is the standard 
SVD method of Shapiro and Brady. Here there is no EM iteration. The corre- 
spondences are selected using the probabilities defined in Equation (12) so that 
/(z) = arg maxj The second is the alignment of the point-sets using the 
standard EM algorithm. This algorithm does not use any information concern- 
ing the pattern of correspondences. Contributions to the log-likelihood function 
are weighted only by the a posteriori alignment probabilities P{zj\wi,<P^'^'>). 
In other words, we set Q j = 1 for all values of i and j. Finally, we use the 
correspondence probabilities defined in Equations (13) and (14) to weight con- 
tributions to the expected log-likelihood function is the dual-step EM algorithm. 
We respectively refer to these three algorithms as SVD, EM and EM-I-SVD. 

We commence in Figure la by showing the effect of positional jitter on the 
fraction of correct correspondences. Here we deal with two point-sets of the same 
size. The positions of the data-points are displaced relative to their counterparts 
in the model by adding random Gaussian measurement errors. The standard 
deviation of the Gaussian noise is kept fixed. However, we gradually add more 




Spectral Correspondence for Deformed Point-Set Matching 129 



Direct comparison between methods 



Direct comparison between methods 




Fig. 1. Effect of a) graph size, b) skew angle, c) measurement error and d) clutter 



points to the two point-sets. This has the effect of increasing the point-density, 
i.e. of decreasing the average inter-point distance. The main effects to note are 
the following. Firstly, all three methods degrade with the increasing size of the 
point-sets (i.e. decreasing inter-point distance). The second feature to note is 
that the new method (i.e. SVD-I-EM) consistently gives the best results. More- 
over, the performance of the pure SVD method degrades rapidly as the relative 
displacement of the points is increased. Finally, the SVD-I-EM method outper- 
forms the EM method by a useful margin. 

Next we investigate the effect of controlled affine skew of the point-sets. 
Figure lb shows the fraction of correct correspondences as a function of the 
skew angle in degrees. Here we compare the standard EM method, the EM 
method with a hyperbolic tangent weighting function (EM-I-SVD-I-EU), and the 
use of pure SVD with both a Gaussian weighting function (SVD-I-GAU) and a 
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hyperbolic tangent weighting function (SVD+EU). Here the use of the dual step 
EM algorithm with the new proximity weighting function (SVD+EM+EU) offers 
clear advantages. There is a 10% margin of improvement for all skew angles. 

We now turn to the effect of positional jitter. Figure Ic compares the effect 
of using the three different ways of computing the correspondence probabilities 
in the dual-step EM algorithm. SVD-I-EM-I-Pl refers to the model presented in 
Equation 12, SVD-I-EM-I-P2 to that presented in Equation 13 and SVD-I-EM-I-P3 
to that presented in Equation 14. Here we add Gaussian measurement errors to 
the positions of the data points. We report the results as a function of the ratio 
of the noise standard deviation to the average inter-point distance. 

Finally, we study investigate the effects of point-set contamination. Here 
we investigate the matching of point sets of different sizes. We add increasing 
fractions of clutter nodes to the data point-set. We commence by noting that the 
the pure SVD method fails as soon as clutter is added. Hence we can not plot 
any sensitivity data from the method. Figure Id shows the fraction of correct 
correspondences as a function of the fractional difference in the point-set sizes, 
i.e. the fraction of clutter added to the data point-set. The different curves 
are for different proximity weighting functions and for different correspondence 
probability models. The main conclusion to note is that the dual step EM method 
(i.e. the curves labelled SVD-I-EM) outperforms the standard EM method by 
about 5-10%. Also, as the fraction of clutter increases then so the performance 
degrades. 

5 Real World Data 

Our final piece of experimental work focusses on real-world data. Here we have 
matched images from a gesture sequence in which a hand is clenched to form a 
fist. The feature points in these images are points of locally maximum curvature 
on the outline of the hand. Figures 2a and 2b respectively show the sequence 
of correspondence matches obtained with the dual-step EM algorithm and the 
SVD method of Shapiro and Brady. In each panel the image on the left-hand 
side is the initial image in the sequence. The images in the right-hand panel 
are those obtained after 1, 15 and 25 frames have elapsed. Here the sequence 
is captured at a rate of 10 frames per second. The matches shown are directly 
from the left-hand frame to the right-hand frame, i.e. no intermediate frames 
are used. In the case of the dual-step EM algorithm, the matches are all correct 
in each frame. In the case of the SVD method the method begins to break after 
17 frames. 



6 Conclusions 

In this paper we have described a method for point-set matching which uses 
the modal structure of the point-proximity matrix to compute correspondence 
probabilities. These probabilities are used in conjunction with the dual-step EM 
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Fig. 2. Matches- dual-step EM-I-SVD (left) and pure SVD (right) 



algorithm of Cross and Hancock [3] to match point-sets undergoing affine dis- 
tortion. The method is shown to be effective for gesture tracking in motion 
sequences. 

Our future plans revolve around developing more sophisticated point defor- 
mation models. Here there are several possibilities including the use of spline 
warps and the use of mixture models to capture articulated motion. 
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Abstract. Motion Capture has been adopted for the production of 
highly realistic movements, as well as for the clinical analysis of patho- 
logical motions. In both cases, a skeleton model has to be identified to 
derive the joint motion. The optical technology has gained a large 
popularity due to the high precision of its marker position measure- 
ments. However, when it comes to building the skeleton frames out of 
the 3D marker positions, significant local skin deformations may penal- 
ize the quality of the model reconstruction. In this paper we exploit a 
local fitting tool to visualize the influence of skin deformation on 
marker movements. Such a knowledge can in turn improve the layout of 
optical markers. We illustrate our viewpoint on motions of the upper- 
torso. 



1 Introduction 

Various motion capture technologies are used for measuring the movement of human 
beings either for animating virtual humans or analysing the movement per se (e.g. for 
sport performance or clinical context). Until now, the most successful technology is 
optical motion capture. This is due to its high precision measurement of little reflec- 
tive markers, attached on some relevant body landmarks. In a production context, the 
movement of an artist is captured with two to eight calibrated cameras. For simple 
motions, the multiple views of markers allow the automatic reconstruction of their 3D 
position. Depending on the system, a static posture [1] or a special calibration motion 
(further referred to as the gym motion) is used to build or adjust a skeleton model. The 
skeleton model helps, in a second phase, to derive the angular trajectories of all the 
captured motions. In this second phase, the markers are generally assumed to be fixed 
in the coordinate system of a body segment. This assumption is weak for a body re- 
gion undergoing large deformations, such as the shoulder. In this paper we exploit a 
recent tool for the analysis of local marker displacements (i.e. with respect to the un- 
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derlying bones). This tool is designed to provide needed information for the skeleton 
fitting task, by highlighting marker sites that undergo important relative motion with 
respect to the underlying bones. It also helps to eliminate redundant markers and 
identify potentially interesting new marker locations. 

The paper focuses first on the problem of skeleton identification for motion capture. 
Then it recalls our local fitting technique for deriving joint center from relative marker 
trajectories. The next section illustrates how it can be used to optimize the marker 
positioning of the upper-torso region. The conclusion summarizes the tradeoffs re- 
garding marker positioning and suggests new research directions. 



2 Related Work 

Identifying the correct location of human joint center from external information is a 
difficult problem. The most simple approach is to scale a standard human skeleton to 
the total height of a given person; needless to say, it requires some adjustments but it 
is sufficient for entertainment applications [2]. Within the same frame of mind, exter- 
nal anatomic features can be detected and exploited from a static 3D envelop captured 
by digital cameras [3]. However, the precision of these two approaches is very low. 
Other promising techniques emerge from the field of video-based motion analysis [4]. 
In [5] an arm recorded with a stereo system is being tracked by fitting a model built 
out of ellipsoids to the data. This way, the skeleton fitting is concomitant to the motion 
tracking. In the longer term, one should be able to derive a generic model of the skin 
deformation from such data, thus paving the way to much more precise identification 
of the underlying skeleton movements. 

Presently optical and magnetic systems prevail in motion capture as they offer the best 
compromise in terms of precision and overall cost (processing and human interven- 
tion). It is a standard working hypothesis in the literature to assume that the markers 
are rigidly linked to the underlying skeleton [6] (it is also reported for magnetic mo- 
tion capture [7], [8]). However, the rigid body hypothesis causes important errors in 
the estimation of the joint kinematics. This was reported in [9] for marker-based sys- 
tems or in [2] for magnetic systems. It is difficult to identify a better model for the 
local movement of the markers as it results from the combination of the interrelated 
movements of the bones, muscles, fatty tissues and the skin. Proposed solutions in 
optical motion capture are: carefully designing marker clusters [10], considering each 
marker separately [11], or allowing partial freedom of motion between the markers 
and the associated bones [12]. This latter work proposes a methodology based on an 
anatomic human model. The human model encompasses a precise anatomic descrip- 
tion of the skeleton mobility associated with an approximated envelope. It has a dou- 
ble objective: by ensuring a high precision mechanical model for the performer, the 
tracking algorithm can predict the 3D location and the visibility of markers. This re- 
duces significantly the human intervention in case of marker occlusion. The work 
described in the present article exploits the visualization features of a local fitting tool 
for which we recall the major characteristics in the next section (we refer the reader 
to [13] for full details). 
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Fig. 1 : Orientation error between a strapped magnetic sensor and the underlying arm during 
axial rotation. A dedicated approach solving for this problem is proposed in [2] for magnetic 
motion capture 



3 Building Local Frames 

When looking for the position of the bones of a person, a first observation is that the 
relative distance of markers attached to one limb is almost constant. The biggest de- 
viations occur when markers are attached on parts that suffer maximal deformation 
during the movement, as around the joints or on massive muscles (e.g. on the thigh). 
Our approach handles this context by decomposing the problem into two tasks: the 
partitioning of markers into rigid cliques and the estimation of joint centers. A clique 
denotes a set of markers where each member remains within a distance tolerance wrt 
all the other markers of the set. Mastering the partitioning and the joint center estima- 
tion allows us to visualize local marker trajectories and thus better understand the skin 
deformations 



3,1 Partitioning Marker into Rigid Segment Set 

In the following, we assume that we exploit a motion called the "gym motion", which 
highlights most of the body mobility with simple movements. The corresponding file 
of 3D marker locations is the input of the partitioning algorithm. 

The partitioning algorithm computes the distances between markers at each frame of 
the gym motion (Fig. 2). It selects the biggest cliques for a given distance threshold. 
This condition defines a rigid segment set. The system may look for the expected 
number of partitions or the user can interactively tune this threshold (Fig. 3). In addi- 
tion, we define the attachment weight of a marker to a segment as a normalized meas- 
ure of the rigidity of its attachment to that segment. By default, all the attachment 
weights have a value of 1.0. 
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Fig. 2. Partitioning algorithm 




Fig. 3. Partitions after user corrections 




Fig. 4. The trajectory of a marker M 
around an adjacent segment OA 



3,2 Visualizing Relative Trajectories of Markers 

If we consider a referential bound to a bone represented by a segment e.g. OA 
(Fig. 4), the markers that are attached on adjacent segments (e.g. OB), theoretically 
move on a sphere centered on the joint that links the two segments (here joint O). This 
comes from the hypothesis of constant distance between markers and joints. 

The position of a 3D object in space is completely defined by three non-colinear 
points. Thus, if we have a minimum of three markers on a segment, we can define the 
position and orientation of that object in space. Afterwards, we compute the move- 
ment of the markers on adjacent segments in the referential established by these mark- 
ers and we estimate their centers of rotation (as in Fig. 5 and Fig. 6). The centers of 
rotation correspond to the joints. From their position in space, we can approximate the 
lengths of the body segments as the distances between them. For example, in Fig. 4 
we can compute the position of the joints A and O in space and we get the dis- 
tance ||AO||. 

Due to the deformations undergone by the skin during the motion, the markers at- 
tached on a limb change their position with respect to the bone. As long as the defor- 
mation is only due to a twisting along the bone segment, it is filtered out by its prop- 
erty of maintaining the distance to the joints. However, a deformation that is changing 
the distance to the bone (e.g. due to muscles such as biceps) or one that changes the 
position along the bone induces unknown errors for the joint center computation. 
Markers suffering such deformation errors are further said to belong to the noisy class. 
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We deal with these errors by introducing a LSQ computation of the center of rotation. 
We use a modified version of the Levenberg-Marquardt method [14] for all of our 
least squares computations. Depending on the complexity of the movements, the er- 
rors sum up or compensate each other (more details in [13]). 




3,3 Estimating the Position of Joints 

In the p segment referential we compute all the centers of rotation for all the markers 
of an adjacent segment a_segment (Fig. 4 ). The center of rotation is estimated as the 
result of the function: 

argmin ^{d{r,x,y,z)xweight{r,x,y,z)f (1) 

^ ’ •’^0 ’ ’ -^0 trajectory 

corresponding to the LSQ minimization [10] of the function: 

d{r,x,y,z)xweight(r,x,y,z) (2) 

where: 

d{r,x,y,z) = ^{x-Xof +(y - y^y +(z - z^)^ -r 

and the function weight{r,x,y,z) computes the inverse of the density of the tra- 
jectory samples in a region of the space. We compute this density by first dividing the 
space in a set of parallelepipeds in which we count the number of points. First we 
compute automatically the minimal box containing all the points of the trajectory and 
we divide it, dividing each direction by a factor of 5 or 10. This increases the impor- 
tance of poorly populated regions of the space, where the performer stays for very 
short time. 
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We then estimate the joint position as the center of mass of the centers of rotation, 
weighted by the associated marker weight (Fig. 7). 

weight = weight(mkr, a _ segment) factor(mkr, a _ segment) ( 4 ) 

In our experiments, we found a good value for factor given by: 

factor {mkr , a _ segment) = 

Let us take Fig. 7 as an example. After defining the system of coordinates bound to 
5*^ , we estimate the center of rotation J of 5*2 in this referential. In order to do this, 
we estimate the center of rotation x ^ of each of the markers M, N and P. Then we 

compute the mass center of the centers of rotation for M, N and P using the weights 
computed with the previous formula: 

Y.(^cen,er^Weighk^„,J 
centers 

centers 

There is a case where the trajectory of a marker describes a circle and not a sphere, 
due to reduced degree of freedom for a certain joint (namely the elbow). We project 
this trajectory in the plan that best contains it. This plan can be found by using a LSQ 
that minimizes the distance between it and the points on the trajectory (Fig. 6). 

A certain attention has to be paid to the case where we have less than three attached 
markers on a segment. This case occurs often in our experiments. Currently, we are 
satisfied with two markers if the adjacent joints can be acceptably modeled as having 
only one rotational degree of freedom. In this case we determine the system of coordi- 
nates by the plane that contains the two markers of the base segment and the marker 
whose trajectory is being tracked. The center of rotation is computed in this plane and 
then mapped back into the global referential. We compute there the center of mass of 
all the centers of rotation, computed for all the markers on a neighbor segment in 
order to find an estimate for the position of the joint. Afterwards, we perform as ex- 
plained before. For example (Fig. 6), we compute all the rotation centers of the mark- 
ers on OA around OB, and all the rotation centers of the markers on OB around OA. 
Then we compute the center of mass using the weights of the considered markers and 
the inverse of the radius of the circles or spheres described by them during their mo- 
tion. 



= 



1 / ( 5 ) 

/ radius{mkr, p _segment) 
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4 Optimizing Marker Position 

We propose to exploit the visualization of markers' local trajectories to get more in- 
sight into the bone/skin relationship. The tool should allow us to: 

> make decisions relative to the inclusion of a bone in the skeleton model, 

> distinguish between bone movement, muscle mass deformation and skin sliding, 

> discover artifacts due to underlying bone movements, 

> appreciate the correlation between bone configuration and marker position. 

We have chosen to illustrate the marker position optimization on a difficult case- 
study to better stress the interest of the visualization tool. We have retained the up- 
down (shrugging) and the forward-backward motions of the clavicles. The first part of 
the study focuses on the relation between partitioning and marker trajectory analysis. 
The visualization tool allows the assessment of the marker locations with the objective 
of retaining pertinent ones while eliminating others. The second part of the study 
highlights the skin sliding in the back region. 



4.1 Test-Case Marker Set and Motion 

Fig. 8 shows the proposed marker set on the spine and thorax (the image is inverted to 
remove the black background). All the useful markers have a label according to the 
following convention. The names of markers on the back start with B. Those on the 
spine have a format Bj with i equal 1, 2 (lumbar), 3,4 (thoracic) and 5 (neck base). 
Two other markers on the thorax are labeled with BL (as Low) and BH (as High). The 
names of markers on the front start with F respectively with FI and F3 on the clavi- 
cles joints and F2, F4, F5 on the thoracic cage. The distance F4-F5 is approximately 
30 cm. Finally the names of markers on the shoulders start with S respectively with 
SR for Right side and SL for Left side. Compared to standard motion capture practice 
[1], the present set of makers is deliberately large to explore the local skin deforma- 
tions. 

The motion is performed so as to highlight a single degree of mobility at a time: here 
the clavicle up-down (schrugging) and forward-backward motions. The motion is 
repeated a few times for each mobility either independently or simultaneously on both 
sides. The same initial posture with dangling arms is used for all motion recordings. In 
addition to being captured, the whole motion capture session has been recorded with a 
digital video camera. A selection of snapshots has been made from the resulting DVD 
tape and is presented below. 
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Fig. 8 Front and back views of the spine and thorax regions 



4,2 Retaining Pertinent Markers 

The partitioning is the action of grouping markers belonging to the same body seg- 
ment. This stage is important since it determines the local frames in which marker 
trajectories are further built and analyzed. Fig. 9 shows the cliques of the upper body. 
The thorax region is divided into three cliques (marker groups): 

❖ the right clavicle group includes SR, FI and BH 

❖ the left clavicle group includes SL, F3 and BL 

Thorax partition includes all the other (labelled) markers. It is relevant to include also 
the spine markers into this partition because the studied motion keeps the back 
straight. 

The trajectory displayed on Fig. 10 exhibits a clear rotation behavior of the SL marker 
during a few shrugging motions. The center of rotation is close to the location of the 
F3 marker put on the left clavicle joint. Although with a short lever arm, the amplitude 
of the rotation is about 60°, indicating that a clavicle segment is highly relevant in a 
skeleton model. The 10cm scale is based on the F4-F5 distance. 

When studying the motion of the SL marker, it quickly appeared that the markers on 
the clavicle joints were suffering strong local displacements due to the clavicle bones 
underlying motions. The variation of the F1-F3 distance is shown on the images of 
Fig. 8 and the successive drawings of Fig. 1 1 (indicated with arrows on (a) and (b)). 

In Fig. 11 (top views) we compare the SL marker local trajectory for an alternate 
thorax partition that excludes the two markers FI and F3 on the clavicle joints. Al- 
though slightly different from the previous partition case, the resulting SL marker 
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trajectory is qualitatively similar, thus justifying the abandon of the F1-F3 markers for 
the thorax partition. In addition, the removed markers FI and F3 proved to be a source 
of difficult 3D reconstruction as illustrated on Fig. 12 with a local rotation artifact 
during the forward clavicle motion. 




Fig. 9.: Initial marker partion- Fig. 10: Front view of the local trajectory of the left 

ing (front view) shoulder marker SL expressed in the thorax partition 

frame 




Fig. 11: Top view of the left shoulder marker trajectory with respect to the thorax partitions: 
initial partition in two postures for high clavicles (a) and low clavicles (b), and alternate parti- 
tion for (c) 




Fig. 12: Local rotation of FI and F3 due to forward clavicles motion 
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4,3 Studying Local Skin Deformations 




Fig. 13: Alternate thorax partition 
(without B4) 



The present section focuses on markers BL and 
BH that are respectively on the low and high 
part of the muscular mass associated to the 
scapula motion. The major aspect in these tra- 
jectories is their translational rather than rota- 
tional nature. These characteristics have al- 
ready appeared on Fig. 8. 

A second alternate thorax partition (without 
marker B4) had to be defined as this marker 
could not be tracked during the whole se- 
quence. The marker B4 is subject to severe 
deformation and pushing of muscular masses 
(that even led to the loss of this marker and the 
repetition of this motion). 




Fig. 14: Front views of marker BL: local trajectories during shrugging (a) and forward- 
backward motion (b). (c) is an enlargment of (b) trajectory 



Fig. 14 explores further the translation characteristics of the markers BL and BH. 
Their excursion ranges are quantified based on the F4-F5 distance. What is especially 
noticeable is the regularity of the trajectories highlighting the high correlation between 
local skin deformation and underlying bone motion. However we have observed sud- 
den displacement of the BH marker in the sagital plane (side view). This artifact pos- 
sibly comes from the underlying motion of the upper part of the scapula. Unless the 
scapula motion is of special importance, the region of BH marker should be avoided. 



5 Conclusion 

This is only one of our first sets of experiences with the analysis of the local trajecto- 
ries of the markers. The possibility to simultaneously view a movement in several 
systems of coordinates, makes the decision process clear and efficient (Fig. 15). It 
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provides pertinent feedback on the marker positioning, although highly depending on 
the quality of the gym motion. Our first conclusion regarding the studied region is 
that: 

♦♦♦ Adding more markers (three per segment), definitely helps to improve the local 
fitting (by selecting and organizing the ones that lead to good estimation), 

❖ The scapula joint seems not identifiable with the current optical motion capture 
methodology. We can however gain a better understanding of the region defor- 
mation. 

Our results on this type of (i.e. with cyclical movements of independant degrees of 
mobility) suggest that, in some body regions, there is a strong correlation between 
skeleton motion and local skin deformation. This is in phase with some observations 
from Cappozzo in [9]. One approach for the identification of this correlation could be 
to train a dedicated neural network with the gym motion for each major body region. 
The result would describe how the marker moves with the skin, as a function of the 
current posture. Compared with the current assumption of fixed marker w.r.t the un- 
derlying bones, we believe such a marker model would improve the skeleton identifi- 
cation and the quality of the captured motion. In the near future we plan to focus on 
the shoulder region with more (and smaller) markers. 
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Fig. 15: Interface for simultaneous visualization in different frames 
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Abstract - In this paper we present a specific matching technique based on 
basic motor patterns between two image sequences taken from different view 
points and a VRML synthetic human model. The criteria used are part of a 
generic system for the analysis and synthesis of human motion. The system has 
two phases: an automatic or interactively supervised analysis phase where the 
motion is interpreted and a synthesis phase where the original motion is applied 
to a biomechanical model. The main goal of our approach consists of finding a 
general solution that could be applied to general motor patterns. We define a set 
of matching conditions and we describe general-purpose criteria in order to 
reduce the space of search. The complexity of human motion analysis has led 
researchers to study new approaches and design more robust techniques for 
human tracking, detection or reconstruction. Whereas mathematical solutions 
partially solve this problem, the complexity of the algorithms proposed only 
serve to limit these solutions for real time purposes or general kind of motion 
types considered. So, we propose more simple, less general approaches but with 
a low computational cost. In this case the human model information about the 
kind of movement to be studied is very important in the process of matching 
between key-frame images. We also try to develop a system that can, at least in 
part, overcome the limitations of view dependence. 

Keywords: detection, tracking and recognition of persons, real and synthetic 
images, VRML, matching, calibration, graphic and biomechanical model, 
walking motion. 



1 Introduction 

Human motion analysis and synthesis are two areas of major interest in a variety of 
disciplines. Among others we can consider sport analysis, dancer training, 
choreography, scientific simulation, 3D animation, computer vision, medical 
rehabilitation, virtual reality and entertainment. There is a great diversity of systems 
devoted to those disciplines. 
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The most largely extended rely on the manual selection of points from the initial 
sequence of images. The 3D positions, velocities, accelerations and trajectories of the 
manually selected points are then easily computed. These systems are mainly oriented 
to capturing the motion parameters. Others systems attach markers - normally semi- 
spherical markers or infrared sensors - on the human body and they use template 
matching to automatically detect the positions of the articulations. 

In the manual systems, the precision is very dependent on the experience of the 
user that picks the 2D points. It is an extremely tedious job to pick up points from 
hundreds or thousands of frames. In the systems that use markers, the inconvenience 
caused by the small reflectors to the person being tested is obvious. Attaching 
markers makes it very difficult, if not impossible, to extend the system to applications 
such as real sport competitions. The systems mentioned above are mostly kinematic. 
Dynamics-based systems, which deal with torques and forces acting on mass, and 
hybrid systems that combine kinematics and dynamics [3, 7, 8, 9, 13] are other 
approaches. We can also find systems that are oriented to modelling motion by using 
physical laws rather than by capturing the motion parameters. 

From a vision view point new approaches have been recently proposed. The 
temporal-flow models are aimed at learning and estimating temporal flow from image 
sequences. They use a set of orthogonal temporal-flow bases that are learned using 
principal component analysis of instantaneous flow measurements. Examples of this 
kind of models are [14, 15]. Other papers consider a probabilistic decomposition of 
human dynamics at multiple abstractions, and show how to propagate hypotheses 
across space, time and abstraction levels. Low-level primitives are areas of coherent 
motion found by EM clustering, mid-level categories are simple movements 
represented by dynamic systems, and high-level complex gestures are represented by 
HMM as successive phases of simple movements. An example of this work is [16]. 
Based on 3D graphs we can recover and interpret human motion in a scene. This is 
the case of [12] where a method, based on regular part detection, is presented by 
which a skeleton is generated. Using 2D regular region, one for each view, with a 
matching process it is possible to obtain a 3D graph interpretation. 

The main goal of our approach consists of finding a general solution that could be 
applied to motor patterns using specific rules for every kind of pattern. We define a 
set of matching conditions and we describe general-purpose criteria in order to reduce 
the space of search. 

The system that we propose is based on a kinematic analysis and is non-invasive. 
The person can move freely without interfering with the system and without the 
burden imposed by the physical markers. It can be considered to be an automatic, 
interactive system since it is designed to guide the user in the matching, providing the 
necessary tools to ease the process and to trace the actual motion of the human body. 

To be able to detect the arbitrary motion of objects we restring the problem to just 
one person without loss of generality. Due to the complexity of the extraction of the 
motion from a sequence of images, we simplify the problem for our system. Initially 
we have considered only 3D articulated rigid motion with rotation. But now we 
present a system that can define matching algorithms for a certain type of motor 
patterns. 

The whole system consists of an analysis part and a synthesis part (Figure 1). The 
input of the system is a sequence of one or more grey level images taken 




Matching a Human Walking Sequence with a VRML Synthetic Model 147 



simultaneously in perspective. The output of the analysis part gives the 3D positions 
of the body parts at each moment of time from which the trajectories, speeds, 
accelerations, as well as other parameters of each part are computed. The 3D positions 
are then visualised with a reconstruction of the human body by the synthesis part. The 
final output is the synthesised motion of the human body, which should be the same 
motion represented in the input image. This can be viewed from any viewpoint with 
all the motion parameters computed for further use. The process of analysis can be 
done either automatically or manually [5, 12]. 



Human Motion Analysis System 




Figure 1. A global view of the human motion analysis system 

The main matching job is driven by specific algorithms and with the support of the 
basic matching conditions of the system to perform the match between the real images 
and the articulated model. This process is more sophisticated than those used by other 
systems working with other kinds of entities (points, lines, etc.) on the screen. It has 
the advantage of working with a higher level entity: a biomechanical model that 
directly models the human body to be matched. 

The interactive analysis part includes a processing-enhancement phase, a 
modelling phase, a matching phase, and an interpretation phase. Although the 
processing-enhancement phase in this system is not always necessary but it has to be 
used in certain cases. The modelling phase lets the user select a biomechanical model 
from a library of predefined models, or even to define a custom model for the person 
being tested. All the information measured from the person is stored in the model. 
Various levels of detail can be defined for visualisation purposes. The goal of the 
matching phase is to match the projections of the bio mechanic model with the real n- 
view image sequence (in this case two views). The main strategy is to match the 
hierarchy of the model as an entity with the image region representing the body 
position to be detected. The system has full information as to the position and 
orientation of the 3D-model in each frame of the real image animation [5]. 

The interpretation phase uses the information from the preceding phases and 
computes the motion parameters for different parts of the human body. All the 
information obtained is stored in a database for future use, such as the visualisation of 
the motion, further analysis of the performance, etc. The synthesis part includes data 
input and display, which allows the animation and visualisation of the resulting 
motion. The user has full control over the synthetic camera. Traditional techniques of 
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Inverse Kinematics and more recent results in Inverse Kinetics [1] can also be used to 
correct deviations of the resulting postures. 

Section 2 briefly introduces the basic ideas of movements founded on basic motor 
patterns with specific consideration for human walking. Section 3 describes the 
criteria for walking matching. The application of the criteria of overlapped surface 
maximisation is in section 4. Finally we use the temporal coherence criterion to 
recover from a wrong matching (section 5) and we present, in section 6 and 7, the 
results and new ideas for future work. 



2 Motion Founded on Basic Motor Patterns 

In the general cases described in [2, 5], the information related to the motion was 
kinematic, in the sense that we know the approximate displacements of certain 
segments, but these neither represent a typical movement of the individual nor cause a 
large displacement of the centre of mass, e.g., a global human body motion. When 
considering the study of a motion of this type, it becomes necessary to do it on a 
characteristic movement of the human body. Therefore, walking is considered the 
most representative basic motor pattern of this kind of movements. Other types of 
patterns (e.g., running, leaping, throwing, grabbing, hitting, kicking, reaching, etc.) 
will be studied in future works. We are also very interested in the transition between 
different movements. 

A common characteristic of these motion patterns is the combination of the 
organised movements according to a specific space-temporal setting. Our goal is to 
carry out a study of the basic patterns of walking from the analysis of image 
sequences obtained by a pair of synchronised cameras and a human biomechanical 
model. 

Human walking is a very complex activity. It is a learned activity and is integrated 
with involuntary actions although it cannot be considered as an automatic activity. 
Walking undergoes several modifications according to the parameters considered 
(shoes, ground, load, activity, etc.). For correctness, we will assume that the 
individual is a western, walking barefoot on a fiat floor. 

Our objective is centred exclusively on the kinematic analysis of the movement, 
considering that the person moves its body from one point to another in the space. 
The body is reduced to a mass point (centre of mass) that is transformed by 
translations and rotations. The body components define a kinematic pattern which is 
characteristic of a movement cycle. This cycle is divided into two parts. The stance 
(foot-floor contact) phase and the swing (foot elevation to create the new step) phase. 

Several studies have been carried out on walking, and human movement in general, 
from different points of view [2, 3, 4, 6]. Several of these studies obtained good 
results and allowed us to define a new sequence of representative charts and values of 
the types of trajectories that the different body segments go through during the 
walking cycle. Unfortunately, these studies, except recent approaches [5, 10, 12, 14], 
use techniques of data collection based on manual selection - either by digitalisation 
of points of interest on the film or by the use of measurement equipment 
(goniometers). Other systems try to model the motion from a synthesis viewpoint, and 
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we have recently seen specific research for concrete motion patterns [7, 8, 13]. Also 
from a computer vision perspective, we can use colour information [10] or knowledge 
of person’s movement in order to track unconstrained human movement [11]. Recent 
publications propose the use of contextual knowledge encoded in the higher level 
models [14], or learned temporal models [15,16]. 

Our system is aimed at carrying out an automatic adjustment of the defined model 
using real images that have been previously filmed. Under the context of manual 
positioning we can obtain motion information in the initial phase of each individual in 
order to position the first walking cycle. 

The first phase will be useful as a control phase and matching with the previous or 
posterior phases, and will give us specific information on each subject. The obtained 
values are validated so that the digitalisation process generates point trajectories that 
agree with the work done by other researchers. The adjustment of basic movement 
patterns greatly differs from the use on 2D and 3D movements. Conceptually, it is 
impossible to consider an exhaustive search of the positions in space of all the 
articulated models. On the other hand, these movements have several occlusions or 
overlappings that make the process of conventional matching (surface maximisation) 
difficult. 

The proposed alternative considers working with a finite number of N positions of 
the walking cycle that have been selected by a semiautomatic process from the control 
cycle defined for each individual. Then, a matching between the model position and 
the image projections is established, not only based on a possible overlapping of the 
surfaces but also on the information of the specific characteristics of the movement 
studied (walking, in this case). This implies that a relationship between the primitives 
obtained from the images and the biomechanical restrictions of the movement must be 
established. 

The system must be able to allow for oscillations or variations over the basic 
defined cycle, although these must be relatively small, for the time being. The result 
will be a objective matching among the N positions of the basic cycle and the 
considered images. 

On the other hand, we present the model graphs overlaid for the cases manually 
studied in our system. For this manual study we consider a step cycle (right-to-left 
and left-to-right) and a non-fast natural walking. For the known values, the mean 
cycle duration is 0.87 seconds with a variation of 0.06 seconds. For normal walking it 
is 1.06 seconds with a variation of 0.09 seconds. We study a typical value for the 
natural walking cycle of 1 second. In our case, 36 images (1.44 s.) are studied for one 
cycle, thus the walking is paused normal. 



3 Criteria for Walking Adjustment 

This section introduces the criteria needed to carry out an adjustment of the 
biomechanical model defined on the walking subject projections. The goal is to find a 
typical parameter sequence for walking, from the image projections that allow us to 
establish some criteria for discrimination between the model projections and the real 
images. A first approach to the problem makes us try the same criterion (already used 
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in conventional adjustment, but with different goals) of studying the evolution of the 
bounding boxes that contain the subject on both views. From the boxes, we could 
study features of the width, height and area funetions on each of them, on the frontal 
and lateral views. Once these representations are obtained, it is necessary to find the 
possible relationships or parameters that are important in order to establish a process 
or pre-selection. The values and charts of these functions are shown in the results 
seetion. The frontal view does not allow us to deduce any relationship or dependency 
among the plots of the studied funetions and the key images that we wish to select. 
When the subject moves towards the camera, its size increases, and therefore so does 
its height, its width and its area funetions. This process hides the possible variations 
due to the movement, and we can not establish a clear selection process. The lateral 
view is more meaningful. The distance between the subject and the camera is not 
significantly modified when walking. So, the variations of the functions considered 
above are due to the movement pattern itself If we observe the lateral view (Figure 
2a), we can clearly see that there are two minima and three maxima along the studied 
cycle. The height function is opposite, but the variations are smaller, and they are 
more error prone. The area function does not give any additional information (it 
depends on the width function). When we observe the walking cycle images, it is 
noticed that the swing phase corresponds to the minima of the width function on the 
lateral view and, similarly, the zones for the stance phase correspond to its maxima. 




a 




b 



Figure 2. The lateral view width (a) and maximum swing (b) functions 
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This first approach to the problem allows us to solve it by reducing it to the initial 
adjustment of the key images. Then, we want to extend this matehing to the sequence 
remainder using a temporal coherence criterion. By exploiting these special points of 
the cycle, we see that the subject in the stance phase has his/her both feet in contact 
with the ground, while in the second phase, a foot swings or moves forward to create 
a new step. By taking into account this characteristic we could define a funetion that 
has a maximum when the feet are in the stance phase and has a minimum when they 
are in the swing phase. This function computes the transition subject-background 
from left to right on a window (of N pixels) on the bounding box for the subject from 
the lateral view. This feature is also useful from the frontal view. The difference with 
respect to the lateral view resides in that, in this case, the maxima are associated to the 
swing phase and the minima to the stance phase. Another useful feature of the frontal 
view that allows us to discriminate the key images more clearly is the vertical pixel 
height difference between the two legs. In other words, the difference between the 
minimum value of eaeh leg is larger for the stance phase and smaller for the swing 
phase (see next examples). 



U U J 

Although this value may seem redundant, it is not, since its sign allows us to 
distinguish which supporting foot is in front. Taking into account all these parameters, 
a function was defined, so that combining them allows the discrimination of the 
points of maximum stance and swing. Its formula is shown below. For the transition 
function of the lateral and frontal view, we compute: 

±TL{i) ±TF{i) 

TL = TF = - 

n n n n 

Where the generic functions TF(i) and TL(i) compute the number of background- 
subject transitions for this co-ordinate inside the window defined in the frontal or 
lateral view, and N is the number of pixels of the selected window. 

A value considered adequate for N is 30 pixels, thus obtaining the following values 
for the lunctions defined above. 
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TLn(Maximun) ~ 2 {stance phase) TL (Minimun) ~ 1 {swing phase) 

n 

TF {Maximun) ~ 2 {swing phase) TF {Minimun) ~ 1 {stance phase) 

n n 



Similarly, the difference between the minimum of each leg is defined as: 

= {MinVal{PI) - MinVal(PD)) 

Where PI is the minimum co-ordinate calculated in the Y axis for the left leg and 
PD is the same for the right leg. This value can be positive or negative, its magnitude 
is used for the discrimination of stances and swings and its sign to know which foot is 
in front. Based on empirical discriminate tests, we define a suitable function that uses 
the three parameters described above and that allows for a correct classification of 
swings and stances: 

MaxBal 

{TLff ,TFff 

The function shown in Figure 2b generates two evident maxima that turn these 
images into candidates as key images for swings. By analogy and based on the same 
tests, we define a discrimination function for the stance maxima. It is reduced to the 
following formula: 

MaxApo = {TF + TL) x DP 

Legs 

With this function, we do not get as good a selection as in the case above, although 
we get a not too large sequence of candidate images that will be tested later on by 
more complex criteria. We should remark that this study is aimed at reducing the 
number of candidate images for key images. Then, it does not have to yield a unique 
solution. On the other hand, in the last case the difficulty in finding a discriminatory 
function can be related to the nature of the movement as it is slower in the stance 
phase than in the swing phase. This is the reason by which the variations of the 
studied parameters are also smaller, and therefore, there are more images near the 
maxima. Figure 3 represents the function of the maximum stances. In this figure, we 
can also see the minima correspond to the maxima of the swing function. 



DPpegs 




Figure 3. The maximum stance function 
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4, Application of the Criterion of Overlapped Surface 
Maximisation 

Based on the previous process, the following step implies defining the adjustment 
criterion using condition C2 (overlapping condition [see 5] for a further explanation) 
to find the function of maximum overlapping on each view for the model and for its 
corresponding image. This criterion is applied over all tree nodes, since the motion to 
study implies a displacement of the centre of mass and, therefore, a motion of almost 
all nodes in the structure. For each model instance, there must exist an image that 
implies a very high value of the overlapping function on both views. This function 
measures the percentage of overlapped area in the image plane for each node, 
dividing the resulting value by the number of nodes considered. Obviously, the 
application of this function reduces the candidate images from the previous process. 

Initially, we assign to each model its own image as a possible candidate, but the 
evaluation of the function is extended to the nearest neighbours. As a result, for each 
model, we must evaluate the overlapping function over the set of candidate images. 
This should be done for each view and the maximum of the function must be selected. 
The function to evaluate is 

IF/W 

MaxSol - — 

N,I,J N 

Where N is the number of considered model nodes, I is the instantiation of a model 
at time t, J is the corresponding candidate image, and represents the function that 
calculates the common area between the model and the object on the image plane. 

For the computed examples and based on the previous plotting, the models are 
evaluated over a reduced image set (3 or 4), while the number of nodes can vary from 
12 (only the most important whose areas are above 100 pixels) until considering all 
possible ones (24 nodes). 



Table 1. Overlapping functions 



M 


■ 


“/oOVERLAPPING 
Frontal View 


“/oOVERLAPPING 
View Lateral 


1 


1 


88,58 


95,31 


1 


2 


88,14 


77,97 


1 


3 


86,47 


60,97 


1 


4 


86,57 


60,97 



The results obtained in this way are correct, due to the fact that the selection of the 
maximum of the function establishes a correct matching most of the time. One 
example is given in Table 1, where it can be seen that the overlapping function is 
evidently a maximum of the image that coincides with the model, from the lateral 
view, while the values for the other images are much lower. In this table, M 
represents an instantiation of the biomechanical model, and I represents the candidate 
image. From the frontal view, this criterion is not useful, and is not applied because 
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the differences, in most of the cases, are very small. The remainder of the tests and 
results is discussed in the following section. 



5 Application of the Temporal Coherence Criterion 



Although previous results show that the function can discriminate the images in 
most cases, there are several situations where it could happen that the selection does 
not match the model. This means that, for a given model, its image does not have the 
same order in the sequence. To verify whether this matching is correct or not, the 
criterion of temporal coherence (C5 condition, see more details in [5]) is used over the 
sequence. This implies defining a window from the candidate image and evaluating 
the model instances over such a window under the assumption of the initial matching. 
This extension implies evaluating the following expression: 



MaxSol 



.J 



T,N,I,J 



I I///(F/(t),0 

l=\k=\ 



Where H ^ (x,t) represents the function of temporal coherence and T is a 



defined window that in the cases evaluated can vary from 3 to 9 images. If we 
evaluate those percentages, we see that the coherence is reduced (85 %) when an 
inadequate selection is made, thus the system selects the second candidate from the 
previous step, and applies the coherence getting a better value (around 96 %) than 
before. This process is repeated until an image is found with a coherence that is above 
an acceptable threshold (typically set to 95%). With this criterion, we make sure that 
the matching over the key images is the best, rejecting possible incorrect matching 
from the previous step. These matching are, most of the time, the results of the 
similarity of some images in the intervals where the movement is slower (stance 
phase). In Tables 2 and 3 we can see one case of wrong matching and its correction. 
The initial matching defines a link of model 8 with image 9. On applying the temporal 
coherence criterion, we see that the percentage drops below the acceptable threshold 
(95%). Therefore, another image is searched for (image 8) getting the optimal 
percentage. 



Table 2. 
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I 


%OVERLAP 

VF 


%OVERLAP 

VL 


8 


7 


91,61 


87,88 


8 


8 


92,79 


96,57 
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9 


92,42 


96,81 
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Table 3. 



M 


■ 


%OVER 
LAP VL(3) 


“/oOVERLA 
P VL(5) 


“/oOVERLA 
P VL(7) 


“/oOVERL 
AP VL(9) 


8 


9 


86,67 


88,77 


84,99 


85,80 


8 


8 


96,92 


96,48 


95,95 


95,96 



The temporal coherence is applied exclusively to the lateral view. As can be 
deduced, the overlapping percentage is not optimal but it does not drop notoriously 
when the window is enlarged. This is because the movement is cyclic and 
symmetrical and some mismatches are balanced. Anyway, in certain images of the 
temporal window the average percentages are low (around 60 %, and for some 
segments, 0%) with respect to the optimal (around 95%). 

The result of this process is a set of models matched with real images that satisfy 
the defined conditions on the adjustment criteria. From this matching, the adjustment 
is extended to the rest of the images in the sequence studied. 



6 Results 

The experiments have been performed using the C programming language on 
Silicon Graphics workstations. The first version was more oriented towards solving 
automatic matching. Now we have a semi-automatic version with a friendly interface. 
Due to the nature of the users of the system, we designed a friendly, graphical user 
interface. What is more, for portability purposes we converted our model into a 
VRML model so we can see the movement on a commercial browser. Here we 
present a reduced set of images with the corresponding matched model for a walking 
sequence. The lateral and frontal matched images are presented. Obviously, the 
recovered matching can be viewed from every point. 



7 Conclusions on the Walking Analysis Methodology and Future 
Work 

In this paper we have presented an interactive analysis and synthesis matching 
system for the detection of human walking motion. The analysis part of the system 
analyses the sequence of two-view input images to detect the 3D spatial information 
in each image frame. The process is a supervised matching unit that combines real 
and synthetic images. As the synthesis part allows for the playback of the detected 
motion of the human body, it is possible to use the results obtained for further 
analysis of the motion. We have presented a set of matching conditions and criteria 
for general purposes. 

For specific motor patterns we must define new strategies to reduce the space of 
search for the matching process. The complexity of a movement which is as 
sophisticated as human walking forces us to use the adjustment criterion and to focus 
on the need to use the knowledge and features of the model analyzed. An exhaustive 
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search (as in the 2-dimensional case of simple movements) is obviously impossible, 
thus an empirical model is introduced that transforms a set of infinite combinations to 
one of N previously set positions, neither does using 3D-graph matching 
approximation solve the problem [12] in complex movements. Our system is less 
robust than those based on learned temporal models [15, 16] but the complexity is 
also reduced and, for the movements studied, good matching is obtained. 

The next goal is to establish a matching between these N positions of the model 
and the real images based on its parameters that satisfy a set of restrictions. We aim to 
integrate a balance criteria restriction in order to increase the rate of convergence to 
the solution. The results obtained are very encouraging, the system is able to carry out 
the matching on all of the proposed cases. Nevertheless, it does not actually allow a 
variation of the rotation parameters of each articulation. This would be useful to 
define a new adaptive system that could adjust to the changes in the rhythms of 
walking and even to transitions to other movements (walking-running, or vice versa). 
We are now evaluating these changes and how the matching functions must be 
modified. We are also working on a possible integration of an analysis-synthesis 
system that uses physical criteria to extract human motion but with a feedback control 
system that uses image processing and includes contextual knowledge in order to 
reduce the search space. 

Either way, the system can easily detect a variation if we consider the ability to 
control the overlapping function. We always require a minimum level of condition 
satisfaction. What we also wonder is what the criterion to keep such a level could be? 
Evidently, our results are conditioned to the situation in which segmented images do 
not have large errors that could notoriously affect the evaluation functions. The 
criteria applied are also exclusive for the selected views (lateral and frontal), so a 
change of camera position implies a search for new discriminatory parameters. We 
must also be careful with occlusion and overlapping of different parts of the body. We 
have already developed some basic criteria to solve specific cases of overlapping and 
occlusions but we consider them to be out of the scope of this paper. We are also 
working with multiple views (3 b/n cameras and 2 colour cameras) and robust 
estimation using Kalman filtering for tracking blobs as in [14] but using full 
information at a high level. 

On the other hand, it is necessary to somehow validate the results obtained 
manually, to guarantee their correction. The validation has been done by comparing 
the results obtained with other works. Plots for the most important nodes analyzed 
agree with the works mentioned as all of the values are within acceptable limits. 
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Figure 5. Walking part 
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Abstract. This paper presents a method of estimating both 3-D shapes 
and moving poses of an articulated object from a monocular image se- 
quence. Instead of using direct depth data, prior loose knowledge about 
the object, such as possible ranges of joint angles, lengths or widths 
of parts, and some relationships between them, are referred as system 
constraints. This paper first points out that the estimate by Kalman 
filter essentially converge to a wrong state for non-linear unobservable 
systems. Thus the paper proposes an alternative method based on a 
set-membership-based estimation including dynamics. The method lim- 
its the depth ambiguity by considering loose constraint knowledge rep- 
resented as inequalities and provides the shape recovery of articulated 
objects. Effectiveness of the framework is shown by experiments. 



1 Introduction 

In general, the 3-D shape of a non-rigid object cannot be recovered with only 
one camera even if an image sequence is given. For a certain class of objects 
like human bodies, however, depth information can be recovered from a monoc- 
ular camera data if adequate prior knowledge about the shape and structure 
is available. For example, we can build a structure model of a human body by 
assuming that its joint motions are related with each other and there are rough 
correlations between the sizes of body parts. Given such constraints, depth es- 
timation can be broken down into a least squares problem [1][2]. Particularly, 
Kalman filter and its variation for non-linear systems (Extended Kalman filter) 
are considered to give estimates with good accuracy and popularly applied to 
estimate an object’s shape or pose parameters. If the constraints are rough to 
be represented by inequalities, this method cannot be applied because the origi- 
nal EKF cannot deal with the inequalities. Shimada et.al [3] recovered the joint 
angles and the finger lengths from a monocular image sequence using an EKF 
variation handling the inequalities by distribution truncation. 

In addition, Kalman filter has an essential problem. Although it can give 
correct estimates only for linear systems and non-linear observable systems, its 
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Fig.l. Incremental update of param- Fig. 2. Possibility limitation by con- 

eter set straint knowledge 



estimate converges to a wrong state due to approximation error in linearization 
for non-linear unobservable systems. As an alternative way, the paper proposes 
an estimation method based on set-membership estimation. In EKF, the ambigu- 
ity of the estimate is described by the estimate covariance In the set-membership 
methods, a possible parameter set described by its boundary is used for repre- 
senting the estimate ambiguity, which is represented by the estimate covariance 
in EKF. In this paper, the combination of the ellipsoidal description [4] [.5] and a 
rectangular description is proposed. In order to estimate time-varying parame- 
ters, the proposed framework introduces an updating scheme by dynamics. Since 
this method does not calculates probabilistic integration but an intersection set 
of possible parameters, it can provide more accurate estimation than EKF be- 
cause it can avoid the accumulation of linearization error. For this ability, loose 
constraint knowledge such as possible ranges of lengths, widths or joint angles 
are available to estimate the shape and pose of articulated objects. While the 
set-membership-based approach has these merits, it is well-known that it has a 
drawback of the weakness against outliers. Here we concentrate on describing 
the integration scheme at each time-frame supposing the existence of a outlier 
rejection method. 

In the following sections, the basic idea is first explained and the details 
are described later. Finally effectiveness of the method is shown by estimation 
results for an articulated object. 

2 Basic Idea 

Although monocular imaging systems have unobservability of depth in shape 
estimation of non-rigid objects, they can estimate depth information if prior 
knowledge is additionally available. For example, we can recover a human body 
by approximating it as an articulated object which consists of rigid body parts 
linked each other and by assuming the following constraints. 
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(a) Shape parameters (lengths and widths) are constant over an image sequence. 

(b) Pose parameters (joint angles and orientation) change continuously. 

(c) Each parameter is within a certain range and has relations with other pa- 
rameters. 

Note that these constraints include not only equations but also loose constraints 
described as inequalities. 

Fig. 1 summarizes the basic idea of parameter estimation with these con- 
straints under “unobservable” systems^. A vector space of parameters describing 
shape and pose is first considered and an initial possible parameter set fit-i is 
supposed. Then a predicted set is generated by shifting and diffusing the 
initial set based on the object’s dynamics and its noise. When an observation 
data is obtained from each image, a parameter set H satisfying the observation 
is considered. Another set F satisfying the constraint knowledge is considered as 
well. Then the updated possible parameter set fit is obtained as an intersection 
of the three parameter sets. This process is repeated for each image observa- 
tion. The important point is that the constraint knowledge has an effect to limit 
unobservable modes as illustrated in Fig. 2. Until the parameter set intersects 
with any boundary of the constraints, the system is still unobservable and the 
ambiguity remains. Once it intersects with a constraint boundary, however, the 
possible parameter set is limited to only the inside of the boundary. If any pa- 
rameters are supposed to be constant, such as lengths or widths of rigid parts, 
the ranges of these parameters should monotonously decrease. As the result of 
iterative intersection with boundaries during sequential observations, ranges of 
any parameters including unobservable modes can be limited. 



3 Limit of EKF Estimation 

3.1 Non-linear Unobservable System 

The idea described in the previous section is naturally implemented in a state es- 
timation framework of Kalman filter or its non-linear version (Extended Kalman 
filter, EKF). Shimada et al. utilized EKF to implement the idea [3]. 

Suppose the transition and observation formulas of a system represented as 

Xt+i = Axt + But ( 1 ) 

yt = h{xt) + wt ( 2 ) 

where is an observation vector. Ut and Wt are white noises with zero mean 
and variances U, W. The components of U for time-invariant shape parameters 
are zeros. If the observation function h(-) is non-linear, the current state Xt and 
variance Pt are approximately estimated by EKF : 

^ The term “unobservability” is used in the control theory. It means that there are 
states which cannot be discriminated by any observation sequences. 
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X 



X 



(a) Linear System 



(b) Non -linear System 



Fig. 3. Linearization error for non-linear unobservable systems 



xt = xt + Kt{Vt - h{xt)} 

Pt = {I- Kth'{xt)){APt-iA^ + BUB^) 



( 3 ) 

( 4 ) 



where Kt is a, Kalman gain matrix and Xt = Axt-i- 

EKF is recognized as the estimator to achieve good accuracy. It is correct for 
linear KF, but not for FKF. Actually the EKF estimate for non-linear unobserv- 
able systems essentially converges to a wrong state under a certain condition. 
Fig. 3 illustrates the reason. In linear systems, the direction of a hyper-plane sat- 
isfying a certain observation y is invariant. Therefore the co- variance ellipsoid 
decreases only in the observable directions and ambiguity of unobservable modes 
is properly preserved. In contrast, the direction of the hyper-plane depends on 
the linearizing point in non-linear systems. Therefore the variance ellipsoid is 
destined to shrink in any directions even if the system is actually unobservable. 
When all parameters are time- varying, this is not a serious problem because the 
erroneous shrinkage is swallowed up by the variance increase in any directions at 
each time step due to the system transition noise. However, it is fatal for systems 
including time-invariant parameters such as a length or width of rigid objects. 
When constant observations only perturbed by noise are obtained, the EKF esti- 
mate converges into a wrong state as shown in Fig. 4. The vertical bars mean the 
range of 2 standard deviation. Once the wrong conversion occurs, the estimate 
is difficult to modify due to too small variance even if observations containing 
new information are obtained. Then tracking fails as a result. Fig. 5 shows such 
a failure example for a 3-link arm in 2-D in which the black circle denotes the 
fixed origin of the arm and the open circles corresponds to the link-joints or the 
most proximal tip of the arm. 

3.2 Inequality Constraints 

The constraints (b) and (c) introduced in section 2 include loose constraints 
represented as inequalities. Although the equation constraints can be treated as 
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Fig. 4. Wrong conversion 




1 


1 

( 

1 

- u 


1 


1 


*7 







(b) EKF estimate 



Fig. 5. Wrong Estimate by EKF 



an observation with a zero variance in the EKF framework, inequalities cannot 
be dealt in such a simple way. Hel-Or el.al[6] modified the inequality constraints 
to quadratic equations with slack variables to linearize them. The linearized 
constraints, however, are quite different from the original ones. Another way is 
to introduce the constraints as an initial distribution. It is also inappropriate 
because the effect of the initial distribution decreases at every frame. 

4 Possibility Reduction Based on Set-Membership 

4.1 Ellipsoidal Boundary Description 

The set-membership methods are known as a way to estimate an unknown sys- 
tem parameters from a sequence of input signals and the corresponding output 
of the system under bounded signal errors. The methods represent a possible 
parameter set as intervals of each parameter [7], a polygon [8] or an ellipsoid [4] in 
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a bounded observation or 




Fig. 6. Update of ellipsoidal set-membership 

a high-dimensional vector space. Fogel et al. [-5] formulated a set updating with 
an ellipsoidal initial set and a linear observation with bounded noise. 

Suppose an initial ellipsoid fit described as 



{xt-qtfPt^{xt-qt)<^ (5) 

and a parameter set H satisfying a noise-bounded observation as 

\mjxt - pi\ < c^. ( 6 ) 

The latter means the region between two hyper-planes. A parameter set F sat- 
isfying an inequality constraint is described as well. A certain ellipsoid flt+i 
covering whole the intersection set flt+i between fit and H{F) (see Fig. 6) can 
be represented with A > 0 as 



{xt - qtfPt ^{xt - qt) + \{mjxt - pif < 1 -b Xcl (7) 

which can be reformed to the form of Eq. 5. Using a criterion minimizing trPt+i, 
A achieving the smallest ellipsoid is the real root of 

/3iA^ -l- /?2A^ -I- P 3 X P 4 = 0 (8) 

where 

Pi = G^cUpG-j), 

P2 = SGcf{pG-"f), 

Ps = GpiScf - e^) - 7{2(c^ - e) + G}, 

/?4 = - e^) - 7, 

p = trPt, 

7 = mJPlm,, 

G = mJPtm^, 
e = Pi- mJq^. 



( 9 ) 
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It can be proved that Eq. 8 always has one real root and two imaginary roots. 
In addition, if the real root is negative, the minimal trP(_|_i in the domain of 
A > 0 is achieved at A = 0 (the details of the proof are described in [-5]). Finally, 
the updated ellipsoid is described in the form of Eq. 5 by 



where 



Qt+i = Qt + XV rn,e, 
Pt+i = zY 



Y = Pt-X 



Ptm.mfPj 
1 + AG 



z = 1 + Xc^ 



Xs^ 

1 + AG' 



( 10 ) 



( 11 ) 



Since this method does not calculate a probability variance but a boundary of 
the intersection, the parameter set should keep its size even if almost the same 
observations are repeatedly obtained. 



4.2 Extension to Dynamic Systems 

Most conventional set-membership methods contained no dynamic mechanism 
[9] or treated dynamics for low dimensional systems using a simple interval de- 
scription [7]. While update by dynamics can be easily introduced into such inter- 
val descriptions, it is unable to represent correlations between each parameter. 
In addition, the set description is inaccurate. The ellipsoidal set-membership 
can be easily updated in high dimensional systems and has the ability to de- 
scribe correlations. Thus we introduce a dynamics and prediction phase into the 
ellipsoid-based method. 

Suppose a dynamics of the system is given as Eq. 1 and the system noise Ut 
is bounded as 



ufU-'^Ut<l. ( 12 ) 

In the {xt, Ut) space, a certain ellipsoid including all of the parameters satisfying 
Eqs. 5 and 12 can be represented as 

{xt - qt)'^Pt^{xt - Qt) + XuJU~^Ut < 1 -I- A (13) 

where A > 0. Combining this and Eq.l, 

Qt+i = Mt (14) 

Pt+i = {l + X)APtA^ + {l + ^)BUB'^ (15) 



is obtained. See Appendix for details. Here, using a criterion which minimizing 
trPt-K, the smallest ellipsoid is given by 



A = 



tv{BUB'^) 

tv{APtA^y 



(16) 
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After this predictional update, the observations and constraints are integrated 
in the way described in section 4.1. 



4.3 Maximum and Minimum Bounds 

In updating by Eq. 11, the parameter set can be represented as an small ellipsoid 
if Ci is small enough compared to the size of the original ellipsoid. Otherwise, the 
updated ellipsoid tends to include a large region outside the true intersection . 
Fogel et al.[5] added a pre-process: when any of the two hyper-plane boundaries 
of Eq. 6 does not intersect with the ellipsoid to update, pi and Ci are modified 
so that the non-intersecting hyper-plane is tangent to the ellipsoid. However, if 
the ellipsoid to update highly sticks out of the hyper-plane boundary, the new 
ellipsoid simply updated by Eq. 11 is almost the same as the original ellipsoid. 

In our method, the maximum and minimum values of each component of x 
are calculated by controlling A in Eq. 7. The maximum and minimum in direction 
n(|n| = 1) are obtained by substituting a new criterion minimizing nFPt+in 
for minimizing trf’t-i-i- A achieving that criterion is obtained by solving Eq. 8 
where 

/3i = G^a{GK - L^) 

/?2 = iGcj{GK-L^) 

Ps = GK{3c^ - e^) -h L\2e'^ - 2cf - G) 
p4 = (c? - e‘^)K - 
K = hF Ptn 

L = mJPtmi. (17) 

The maximum Xmax and minimum are obtained by 



Xmin{n) = n^s - y/n'^Qn (18) 

Xmax{n) = n^s + \J 71^ Qn (19) 

s = q^ + XY rriie (20) 

Q = zY (21) 

where z and Y are given by Eq. 11. If Xmax and Xmin for any n can be calcu- 
lated, a convex hull of the true intersection set Pt+i is obtained. Since systems 
for articulated objects have much high dimensions, each coordinate axis of the 
parameter space is used as a representative. Whenever an observation is ob- 
tained and the ellipsoid is updated in time t, x^^^^ and Xmax for each parameter 
are calculated and then max (a^min > ^ xaiii(xmax , Xmax^ ) are preserved. 

At the final stage of each update, a constraint in the form of Eq. 6 is made 
from and Xmax and the ellipsoid is updated by the constraint in the way 
of Eq. 11. 
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Linearized Bound 



(a) Linearization 



Original Bound 
(b) Segmentation 



Fig. 7. Compensation for non-linearity of bound 



4.4 Compensation for Non-linearity of Bounds 

In the previous section, we summarize our set-membership-based estimation sup- 
posing the observation and constraint formulas are linear as Eq. 6. Since they are 
actually highly non-linear like Eq. 2, they need to be approximated as a linear 
form: 



Since this approximation includes linearization error, p and c should be deter- 
mined so that the parameter set satisfying Eq. 6 includes one satisfying both 
Eq. 22 and noise bound 



First a linearizing point is found by iterative solving of Eq. 22 with an initial 
solution Xt- Then Eq. 2 is linearized and decomposed into each component yi. 
The decomposed h[ is determined as rrii in Eq. 6. Next hi is sampled in the 
predicted ellipsoid and checked whether Wi = hi — yi is within its bounds of 
Eq. 23. Pi and Ci are determined from the passed samples as shown in Fig 7(a) . 
If the passed samples are divided in some divisions like in Fig 7(b) , linearization 
and sampling are started again for each division and then multiple segmented 
bounds are obtained. In such cases, multiple updated ellipsoids are generated. 
Each corresponds to a candidate of interpretation. 



y^ = h'{xt){xt - Xt) -b h{xt) + Wf 



( 22 ) 



wtW ^Wt < 1 . 



(23) 
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Fig. 8. Representation of an articulated object 



Table 1. Object constraints 
(tight set) 



pose 

constraints 


1^2 — Oi] < 7r/6rad, 
Orad < 62,93 < 

7r/2rad, 


shape 


0 < ri — r2 < 15, 


constraints 


|c2 - ral < 15, 




65 < ri < 90, 




50 < ra < 75, 




45 < rg < 65 



Table 2. Object constraints 
(loose set) 



pose 

constraints 


|t*2 — Sal < 7r/6rad, 
Orad < 62,63 < 

7r/2rad, 


shape 


0 < ri — ra < 30, 


constraints 


ra - rg < 30, 




52.5 < ri < 102.5, 




37.5 < ra < 87.5, 




20 < rg < 80 



5 Experimental Result 

5.1 Experimental Setup 

We have performed computer simulation experiments to prove the validity of 
the method. For simplicity, we use a 2-D link object as in Fig. 8 to estimate its 
lengths r and joint angles 9. It has a three joints rotating va. & x — y plane and its 
dynamics is unknown - transition matrix A is assumed to be identity. However, 
all joint angles and the differences between the most proximal joint angle and the 
second proximal one are constrained within a certain range. Possible ranges of 
the link lengths are also assumed. As an observation system, only 1-D position x 
for each joint is observed and the depth y is not available. The transition and 
observation noise is supposed to be bounded. 

5.2 Result of Shape Recovery and Accuracy 

In order to verify the estimation accuracy, we show estimation results for the 
link object. First, the tight constraint set shown in Tab. 1 is applied. The result 
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(a) 98th frame (cor- 
rect) 




timate) 




(b) 127th frame 

(correct) 




(e) 127th frame (es- 
timate) 



Fig. 9. Multiple estimates 




(c) 138th frame 

(correct) 




(f) 138th frame (es- 
timate) 



is shown in Fig. 10 and 11. Then another constraint set shown in Tab. 2 twice 
times looser than Tab. 1 is applied. The result is shown in Fig. 12. With the tight 
constraints, the estimated range of r (shown by each vertical bar) is getting small 
in early time. With the loose constraints in contrast, the range is getting small 
more slowly but finally converges to the correct value (shown by a straight line). 
The final range is almost the same as the tight constraints. In Fig. 10, there are 
cases that more than one estimates (shown as circles) are obtained. This means 
that there are multiple symmetric interpretations due to depth ambiguity (see 
Figs. 9(d)-(f)). 



5.3 Dependency on Initial Estimate 

Next, in order to verify dependency of the estimate on initial estimates, we show 
estimation results for the same observation sequence starting from two different 
initial estimates in Figs. 11 and 13. Regardless of initial estimates, each shape 
estimate of r\ finally converges to almost the same correct value. This means 
that there is no remarkable dependency on initial estimates. 



5.4 Identification of Different Shape 

In order to verify the ability to identify different shapes, we show estimation 
results for two objects whose link lengths differ from each other. The estimation 
for each object starts from the same initial estimate and with the same con- 
straint set. As the experimental result. Fig. 14 shows the correct, initial, and 
finally estimated objects extending all joints in order to compare lengths of the 
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Fig. 10. Estimate of joint angle 62 
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Fig. 11. Estimates of the 1st link 
length ri 
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Fig.l2. Estimates of the 1st link 
length ri with looser constraints 
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Fig.13. Estimates of the 1st link 
length ri (started from another 
initial estimate) 





corresponding links. Concludingly, Fig. 14 tells that all link lengths appropri- 
ately converge to the correct values for both objects. This means that the system 
correctly identified the two different shapes. 

6 Conclusion and Discussion 

This paper has proposed a method which simultaneously estimates a 3-D shape 
and moving pose of an articulated object from a monocular image sequence. 

This paper pointed out a Kalman filter estimate essentially converges to 
wrong state for non-linear unobservable systems. For an alternative, the paper 
has proposed a set-membership-based method including dynamics. The method 
limits the depth ambiguity by considering inequality constraints such as possible 
ranges of joint angles, lengths or widths of parts. Using these constraints, it pro- 
vides shape recovery of articulated objects under monocular systems. Effective- 
ness of the framework has been shown by estimation results for an articulated 
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Initial Estimate: 

Correct Links: 

I I I 

Current Estimated Links(l): 

(a) Final Estimated Shape 1 



Initial Estimate: 

Correct Links: 

I I I 

Current Estimated Links(l): 

(b) Final Estimated Shape 2 



Fig. 14. Object identification 



object. Although the system shown in the experiment section adopts an only 
two-dimensional link object whose joint points can be observed, the proposed 
method can be applied to cases of three-dimensional objects. The method can 
use any other observations such as object axes or contours extracted by edge 
detection or active contour method. A weak point of set-membership methods 
is an outlier problem. Since bounded noises are supposed, the intersection set 
may become empty by an outlier noise over the pre-estimated noise bounds. It 
is a future work to detect and remove outliers. 

Appendix: Predictive Update by Dynamics 

Here, define a vector Zt as 



zt = {(xt - qtf . (24) 

With the above equation and Eq. 13, the following relation 

z^^Qzt<\ where Q = (25) 

is derived and Eqs. 1 and 24 conclude 

Xt+i = A{xt - q>t) -t- But + = {A B) Zt + Aq^. (26) 

In general, the following relations 

z'^Q~^z<l, x = Cz + d (27) 



yield 

(a;-d)^(C'gC^)"^(a;-d) < 1 (28) 

if m < n and rank C equals to m where C is a to times n matrix (this condition 
means that {Cz} spans a linear space of to dimensions). Therefore Eq. 13 is 
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rewritten as the form of Eq. 5 and the followings are derived by substituting 
C = (A B), d = Aq^ and x = Xt+i into Eq. 28: 

BUB^^{xt+i-Aq^)<l{29) 

By comparison of Eq. 29 with Eq.5, Eqs. 14 and 15 are derived. 



(®t+i ~ 1 X)AP tA^ +11 + 



A 
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Abstract. Recognizing activities in image sequences is an open problem 
in computer vision. In this paper we present a method to extract the most 
significant frames from an activity sequence. We name these frames as 
the keyframes. Moreover, we describe a pre-processing stage in order to 
build a robust representation for different human movements. Using this 
representation, we build an activity eigenspace that is used to obtain a 
probability measure. We use this measure to develop a method to select 
the activity keyframes automatically. 



1 Introduction 

Humans use a lot of languages to communication. One of them is body language 
which can give different types of information such as emotions and warnings. 
The basis of body language are the human movements and activities. The parts 
of the human body can move in quite independent ways. Human movements, 
such as walking, running, kicking, etc., are very constrained by factors including 
motion symmetries or dynamics. A sequence of human movements is a human 
activity. 

Recognizing human activities is not a new task in computer vision [1,4,2]. 
There are two basics approaches: physics or appearance based. The physics based 
approach might involve locating and tracking the limbs and extremities of the 
body under control of a mechanism that optimizes the tracking with respect to 
known physical contraints embedded in a body model. This appoach requires a 
controlled environment due to the difficulties of identifying body parts in natural 
video imagery. Also, it is difficult to develop efficient computational methods for 
modeling and enforcing such physical constraints. 

The alternative approach is to use appearance based models for human mo- 
tion. The main challenges to such models are viewpoint dependence, dealing with 
appearance variability (changes in clothing, shadows, body size and proportions 
between individuals), recognition in the presence of occlusion, etc. These chal- 
lenges involves that the domain of such methods is still very constrained, i.e., 
stationary or controlled background. Our work is based on this approach. 

Consider the frame shown in Fig. 1. Even with only one frame you can rec- 
ognize the activity as someone running. Such capability argues for recognizing 
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Fig. 1. Eadweard Muybridge, The Human Figure in Motion, frame extracted 
from Plate 23: Man running. [8] 



an activity only by selecting few frames from the entire sequence. These frames 
correspond to the most distinctive movements of the activity. We will call them 
the keyframes of an activity. This paper addresses the problem of automatic 
selection of these characteristic frames. With these frames, we have a represen- 
tation of the entire activity. By detecting these keyframes in an on-line sequence, 
we should be able to describe the activity which is being performed. 

The method presented here is based on appearence methods. Therefore, we 
need to use a description of human body that is invariant to changes in environ- 
ment and in human clothing. In section 2 we describe this process. In section 3, 
we present an automatic algorithm for the extraction of keyframes following a 
probabilistic scheme. Experiments are showed in order to validate our method. 
Lastly, conclusions and further work will also be presented. 



2 Human Body Description 

Before extracting the keyframes from an activity sequence, some pre-processing 
operations are required. An activity recognition system involves, first of all, the 
segmentation of the actors in order to detect them. To achieve such a segmen- 
tation, we have followed the method described by Oliver et ah, the EigenBaek- 
ground [9]. 

This method of background subtraction copes with lighting changes if ob- 
jects of interest are small because it is possible to adapt the model during the 
extraction process. In our case, we do not adapt the model after the learning 
stage due to size of the actors. Therefore, it is necessary a previous stage in order 
to correct the effects of lighting. The global variation of lighting in a sequence is 
found and removed from the input images. 

The description details of the lighting correction process can be found 
in [10]. Once a set of background images {B^{x,y),B'^{x,y),- ■ ■ ,B^(x,y)} 
has been acquired, the reference background image M{x,y) = 
{Mn{x,y), Mc{x,y), MB{x,y)) is computed as its average. The effects of 
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illumination changes are assumed to be isotropic and to act as an indepen- 
dent contrast factor for each color channel. That is to say, a background 
image B^{x,y) can be expressed as: 

fkn 0 0 \ 

B\x,y) = M{x,y) \ 0 kc 0 \+e{x,y) (1) 

Vo Q kBj 

where e(a;, y) = (efi(cc, y), eg{x, y), £_b(x, y)) is a zero mean noise term and k^, kc 
and ks are the parameters that must be determined in order to correct the 
illumination. An estimation of the k factors (fc/{, kc and ks) could be found by 
minimum squared errors. 

In order to perform the background modeling, we compute the mean back- 
ground image, fib, and its covariance matrix, Eb- After applying dimensionality 
reduction using Principal Component Analysis (PC A), the background model 
is represented by a set of eigenvectors corresponding to the biggest eigenvalues. 
Small movements of the camera and the background are well described in this 
eigenspace model, so a more robust segmentation is achieved. 

When an image from the activity sequence is processed, after the ligthting 
correction is applied, it is projected onto the eigenspace of the background model. 
By reconstructing the image, pixels corresponding to foreground regions will 
dissappear, due to the fact that they have no contribution to the background 
model. After, subtracting the reconstructed image from the original image gives 
the set of pixels corresponding to foreground objects. A thresholding operation 
is performed in order to generate clean silhouettes of the foreground objects. We 
can see the results of the segmentation process in Fig. 2. 



T 



Fig. 2. Example of the segmentation process. Left: Original frame. Right: Actor 
segmented 




In other related works such as [11,5,3] the blob-like representation is sufi- 
cient to achieve activity recognition. However, they are based on 2D models of 
humans and simple blob statistics. These features are not enough to extract the 
most distinctive frames from an activity. Kovacs et al. suggested a concise rep- 
resentation of visual shape based on skeletal representations [6]. In their work, 
they computed a function based on an equidistance metric. Thus, a grey-level 
non-uniform skeleton is found. Peaks represent the largest amount of edge infor- 
mation. Based on their work, we apply the transform map onto the segmented 
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image. This representation is local, compact and suitable to perform multi-scale 
analysis. Also it is highly affected by changes in shape. This is a very desiderable 
characteristic because we seek, in fact, changes in shape in order to select the 
most distinctive ones. 





Fig. 3. Human description used for the keyframe extraction method 



For our purpose, this representation has also another great benefit : it is 
independent of the appearance. That is, it doesn’t mind, for example, if actors 
wear different clothes. We don’t use the pixel values of the original frame. Ap- 
plying the distance map involves that only the shape of the actor is relevant 
to perform the keyframe analysis. In Fig. 3 it is shown the final image of the 
complete preprocessing stage. 



3 The Method for Keyframes Selection 



In a sequence, the most distinctive frames of a human activity, the keyframes, 
correspond to the less probable ones. That means, in an activity we can found 
quite a few repetitive frames. The less repetitive frames corresponds to extreme 
motions, and they are those which distinguish between different activities. Our 
goal is to find these frames. 

Using different example activity sequences, we build an eigenspace in order 
to represent an activity as a set of points, each one corresponding to a frame of 
the sequences. The method for human description described before is applied to 
obtain the movement description for each frame of the sequence. The descriptions 
extracted from different image sequences of the same activity are used to model 
the activity eigenspace. 

Using the activity eigenspace we can use a likelihood measure, described by 
Mogghaddam and Pentland in [7] . This method uses a Gaussian density in order 
to model the eigenspace. Therefore, it is possible to compute the probability that 
a pattern, x, i.e. a frame, belongs to the eigenspace. Moreover, if the dimension- 
ality reduction is applied, the authors present a distance definition as a sufficent 
statistic for characterizing the likelihood. 



M 



d{^) = 



c2(x) 



i=l 



A,; 



( 2 ) 
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where y are the frame projection in the activity eigenspace, \t are the eigenval- 
ues, and M is the number of dimensions of the reduced eigenspace. e^(x) is the 
distance from feature space (DFFS) and it is computed as 

N M 

Vi =\\^\\'^ 

i—1 

where x is the frame x (image pixels in lexicographic order) mean normalized 
(by subtracting the mean), and N is the original dimension of the image space. 
Finally, p is a weight that is found by minimizing a cost function (see [7]), 



P = 



1 

N -M 



N 



Y. 

z=M+l 



(4) 



Using d(x) we have a distance measurement for each frame. Applying a temporal 
ordering to each distance of an activity sequence we obtain a graphic distance vs 
time that is used to find the keyframes. For example, in Fig. 4 it can be seen the 
graphic for an activity sequence. Peaks of Fig. 4 correspond to less likely frames. 
These less likely frames can be thought as the frames that contains more infor- 
mation (the entropy is the inverse of probability). Therefore, these frames are 
considered as the keyframes of the activity sequence. Moreover, temporal order- 
ing makes suitable to use a difference operator in order to obtain the keyframes. 
The graphic of Fig. 4 has been calculated from the sequence shows in Fig. 5. 

The method interpretation is that extreme motions correspond to the most 
distinctive frames from an activity sequence. The rest of movements are repeti- 
tive and they occur to arrive to the extreme movement. From a probability point 
of view, extreme movements are the less probable in an entire activity sequence. 
This fact is simply a consequence of the number of frames where the movement 
appears (less than the repetitive movements). 



4 The Experimental Results 

The complete process to obtain the keyframes for an activity are: 

1. To obtain a sufficient number of example sequences for an activity. 

2. To normalize the activity sequences using the preprocessing stage to compute 
a robust description for the human movements. 

3. To build the activity eigenspace applying PC A to the normalized sequence. 

4. To compute the frame distances for each sequence. 

5. To use a voting scheme to select the keyframes. 

The voting scheme consists of obtaining the keyframe candidates for each 
activity sequence of the sequence data set used to build the activity eigenspace. 
Candidates that appear in most sequences are selected as keyframes. In Fig. 6 
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Fig. 4. This graphic shows the distance values, d, for each activity frame in 
temporal order 
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Fig. 5. Sequence used to draw the graphic of Fig. 4. The frames are ordered in 
lexicographic order. From Fig. 4 the maximum values correspond to frames 9, 
14, 19 and 25th. These are the keyframe candidates 
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Fig. 6. Keyframe candidates : each column represents the keyframe candidates 
for each of five sequences of the same activity 



we show the candidates of different sequences corresponding to the same tennis 
activity. 

In order to evaluate the algorithm we took sequences as shown in Fig. 5. We 
took four different tennis activities, and five different sequences for each activity. 
Fig. 7, 8, 9 and 10 show the keyframes generated for each activity. One can easily 
guess which kind of activity is being represented by each set of keyframes. 
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Fig. 7. Keyframe set for first tennis activity 



5 Conclusions and Further Work 

We have defined keyframes like the most representative frames from an activ- 
ity sequence. To obtain these frames, we preprocess the image frames to obtain 
a suitable representation for human movements. Using this representation, we 
build an activity eigenspace and we use a likelihood measure to select automat- 
ically the keyframes for an activity. 
























180 



X. Varona et al. 



i-tli.Ji 



Fig. 8. Keyframe set for 
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second tennis activity 





Fig. 9. Keyframe set for third tennis activity 





Fig. 10. Keyframe set for fourth tennis activity 



The natural extension of our work is test the activity keyframes in real se- 
quences of simple human movements. Our method can be used in an on-line 
manner to known the human activities when a keyframe appears. 
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